Regex Cheatsheet

Regular Expressions (short Regex) are very useful for any text-related task. From scraping to analysis, everybody encounters Regex at some point.


this should be an image

That's actual syntax btw.

Some areas to use Regex for:

  • data validation
  • data scraping
  • data wrangling
  • string parsing
  • string replacement
  • syntax highlightning
  • Packetsniffing
  • File renaming

Basics

Surround every expression with /,like: /abc/


Anchors
Regex Explanation
^The Finds strings which start with The
end$ Finds strings which end with end
^The end$ Finds this exact string
roar Finds strings which have roar in them


Quantifier / Qualifier
Regex Explanation
abc* Finds strings which contain ab followed by 0 or more c
abc+ Finds strings which contain ab followed by 1 or more c
abc? Finds strings which contain ab followed by 0 or 1 c
abc{2} Finds strings which contain ab followed by 2 c
abc{2,} Finds strings which contain ab followed by 2 or more c
abc{2,5} Finds strings which contain ab followed by 2 or up to 5 c
a(bc)* Finds strings which contain a followed by 0 or more bc
a(bc){2,5} Finds strings which contain a followed by 2 or up to 5 bc


OR-Operator

Regex does not have an AND-Operator, only an OR.

Regex Explanation
a(b|c) Finds strings which contain a followed by b or c (possible: ab, ac, abc, acb)
a[bc] Just like before, but this time without capturing b or c


Character classes
Regex Explanation
\d Finds digits
\w finds word characters (alphanumeric and _)
\s finds a whitespace (including tabs and line breaks)
. finds any character (should be used carefully, other classes are faster and more precise)
\D negation of \d
\W negation of \w
\S negation of \s

You can also search for non printable chars like \t, \n and \r. To search for special chars, simply escape with \ like this: \:, \$, \., \{, \[ ...
An example for a combination: The expression \$\d finds the $-sign in front of digits.


Flags

You can define flags at the end of an expression:

Regex Explanation
g (global) does not return after the first match, restarting the subsequent searches from the end of the previous match
m (multi-line) when enabled ^ and $ will match the start and end of a line, instead of the whole string
i (insensitive) makes the whole expression case-insensitive (for instance /aBc/i would match AbC)


Grouping and capturing
Regex Explanation
a(bc) parentheses create a capturing group (possible: abc)
a(?:bc)* using ?: disables the capturing group (possible: a)
a(?<foo>bc) using ?<foo> will give the group the name foo

Naming a group enables the possibility to search through the result like a dictionary (data type), whereby the keys correspond to the respective name of the capturing group.


Bracket expressions
Regex Explanation
[abc] finds strings that either have one a or b or c (equal to a|b|c)
[a-c] same as before (scope from a to c)
[a-fA-F0-9] finds strings that either have one a to f or A to F or 0 to 9 (hexadecimal btw.)
[0-9]% finds strings that have 0 to 9 followed by a %
[^a-zA-Z] finds strings that have no letter from a to z or from A to Z (in this case ^ is used as negation of the expression)

Important to note: Every regex and even the \ will lose its meaning inside the [ ] and don't need to be escaped.


Greedy and Lazy match

The quantifiers (* ++ { }) are greedy operators, so they expand the match as far as they can through the provided text.
Given this string: This is a <div>simple div</div> test
The regex <.+> will find <div>simple div</div>. To only find the the tags <div> and </div>, use ? to make the expression lazy: <.+?>
An even better expression would be (to avoid the . operator): <[^<>]+> - Explanation: Matches every sign, except < and > 1 or more times inside < and >.


Advanced stuff


Boundaries
Regex Explanation
\babc\b only searches for "whole words" (possible: abc, -abc/ - not possible: ab, abcc, babc

The \b operator is similar to the anchors ^ and $, where one side is a word character (such as \w) and the other side is a non-word character (for example the beginning of a string or a space).
It also has a negation \B:

Regex Explanation
\Babc\B matches only if the pattern is fully surrounded by word characters (possible: babcd - not possible: ab, abc, abcc


Back-references
Regex Explanation
([abc])\1 using \1 it matches the same text that was matched by the first capturing group
([abc])([de])\2\1 use \2 (\3, \4, etc.) to identify the same text that was matched by the second (third, fourth, etc.) capturing group
(?<foo>[abc])\k<foo> the foo group is referenced later (\k<foo>). The result is the same of the first regex.


Look-ahead and Look-behind
Regex Explanation
d(?=r) matches a d only if is followed by r, but r will not be part of the overall regex match
(?<=r)d matches a d only if is preceded by an r, but r will not be part of the overall regex match

This can also be negated:

Regex Explanation
d(?!r) matches a d only if is not followed by r, but r will not be part of the overall regex match
(?<!r)d matches a d if is not preceded by an r, but r will not be part of the overall regex match

Useful expressions


Regex Explanation
^(.*)(\r?\n\1)+$ Finds duplicates in consecutive lines (unique, unique, duplicate, duplicate, unique ...)
<[^>]*> Removes all HTML tags: <b>test</b> becomes test and <a href="https://www.google.de/">Google</a> becomes Google
^(?:[\t ]*(?:\r?\n|\r))+ Removes all empty lines

Hint: This article is based and extended uppon this post I found during my research.