Regex Cheatsheet

Regular Expressions (short Regex) are very useful for any text-related task. From scraping to analysis, everybody encounters Regex at some point.

That's actual syntax btw.

Some areas to use Regex for:

data validation
data scraping
data wrangling
string parsing
string replacement
syntax highlightning
Packetsniffing
File renaming

Basics

Surround every expression with /,like: /abc/

Anchors

Regex	Explanation
^The	Finds strings which start with The
end$	Finds strings which end with end
^The end$	Finds this exact string
roar	Finds strings which have roar in them

Quantifier / Qualifier

Regex	Explanation
abc*	Finds strings which contain ab followed by 0 or more c
abc+	Finds strings which contain ab followed by 1 or more c
abc?	Finds strings which contain ab followed by 0 or 1 c
abc{2}	Finds strings which contain ab followed by 2 c
abc{2,}	Finds strings which contain ab followed by 2 or more c
abc{2,5}	Finds strings which contain ab followed by 2 or up to 5 c
a(bc)*	Finds strings which contain a followed by 0 or more bc
a(bc){2,5}	Finds strings which contain a followed by 2 or up to 5 bc

OR-Operator

Regex does not have an AND-Operator, only an OR.

Regex	Explanation
a(b\|c)	Finds strings which contain a followed by b or c (possible: ab, ac, abc, acb)
a[bc]	Just like before, but this time without capturing b or c

Character classes

Regex	Explanation
\d	Finds digits
\w	finds word characters (alphanumeric and _)
\s	finds a whitespace (including tabs and line breaks)
.	finds any character (should be used carefully, other classes are faster and more precise)
\D	negation of \d
\W	negation of \w
\S	negation of \s

You can also search for non printable chars like \t, \n and \r. To search for special chars, simply escape with \ like this: \:, \$, \., \{, \[ ...
An example for a combination: The expression \$\d finds the $-sign in front of digits.

Flags

You can define flags at the end of an expression:

Regex	Explanation
g	(global) does not return after the first match, restarting the subsequent searches from the end of the previous match
m	(multi-line) when enabled ^ and $ will match the start and end of a line, instead of the whole string
i	(insensitive) makes the whole expression case-insensitive (for instance /aBc/i would match AbC)

Grouping and capturing

Regex	Explanation
a(bc)	parentheses create a capturing group (possible: abc)
a(?:bc)*	using ?: disables the capturing group (possible: a)
a(?<foo>bc)	using ?<foo> will give the group the name foo

Naming a group enables the possibility to search through the result like a dictionary (data type), whereby the keys correspond to the respective name of the capturing group.

Bracket expressions

Regex	Explanation
[abc]	finds strings that either have one a or b or c (equal to a\|b\|c)
[a-c]	same as before (scope from a to c)
[a-fA-F0-9]	finds strings that either have one a to f or A to F or 0 to 9 (hexadecimal btw.)
[0-9]%	finds strings that have 0 to 9 followed by a %
[^a-zA-Z]	finds strings that have no letter from a to z or from A to Z (in this case ^ is used as negation of the expression)

Important to note: Every regex and even the \ will lose its meaning inside the [ ] and don't need to be escaped.

Greedy and Lazy match

The quantifiers (* ++ { }) are greedy operators, so they expand the match as far as they can through the provided text.
Given this string: This is a <div>simple div</div> test
The regex <.+> will find <div>simple div</div>. To only find the the tags <div> and </div>, use ? to make the expression lazy: <.+?>
An even better expression would be (to avoid the . operator): <[^<>]+> - Explanation: Matches every sign, except < and > 1 or more times inside < and >.

Advanced stuff

Boundaries

Regex	Explanation
\babc\b	only searches for "whole words" (possible: abc, -abc/ - not possible: ab, abcc, babc

The \b operator is similar to the anchors ^ and $, where one side is a word character (such as \w) and the other side is a non-word character (for example the beginning of a string or a space).
It also has a negation \B:

Regex	Explanation
\Babc\B	matches only if the pattern is fully surrounded by word characters (possible: babcd - not possible: ab, abc, abcc

Back-references

Regex	Explanation
([abc])\1	using \1 it matches the same text that was matched by the first capturing group
([abc])([de])\2\1	use \2 (\3, \4, etc.) to identify the same text that was matched by the second (third, fourth, etc.) capturing group
(?<foo>[abc])\k<foo>	the foo group is referenced later (\k<foo>). The result is the same of the first regex.

Look-ahead and Look-behind

Regex	Explanation
d(?=r)	matches a d only if is followed by r, but r will not be part of the overall regex match
(?<=r)d	matches a d only if is preceded by an r, but r will not be part of the overall regex match

This can also be negated:

Regex	Explanation
d(?!r)	matches a d only if is not followed by r, but r will not be part of the overall regex match
(?<!r)d	matches a d if is not preceded by an r, but r will not be part of the overall regex match

Useful expressions

Regex	Explanation
^(.*)(\r?\n\1)+$	Finds duplicates in consecutive lines (unique, unique, duplicate, duplicate, unique ...)
<[^>]*>	Removes all HTML tags: <b>test</b> becomes test and <a href="https://www.google.de/">Google</a> becomes Google
^(?:[\t ]*(?:\r?\n\|\r))+	Removes all empty lines

Hint: This article is based and extended uppon this post I found during my research.