Regex Cheatsheet
Regular Expressions (short Regex) are very useful for any text-related task. From scraping to analysis, everybody encounters Regex at some point.
That's actual syntax btw.
Some areas to use Regex for:
- data validation
- data scraping
- data wrangling
- string parsing
- string replacement
- syntax highlightning
- Packetsniffing
- File renaming
Basics
Surround every expression with /,like: /abc/
Anchors
| Regex | Explanation |
|---|---|
| ^The | Finds strings which start with The |
| end$ | Finds strings which end with end |
| ^The end$ | Finds this exact string |
| roar | Finds strings which have roar in them |
Quantifier / Qualifier
| Regex | Explanation |
|---|---|
| abc* | Finds strings which contain ab followed by 0 or more c |
| abc+ | Finds strings which contain ab followed by 1 or more c |
| abc? | Finds strings which contain ab followed by 0 or 1 c |
| abc{2} | Finds strings which contain ab followed by 2 c |
| abc{2,} | Finds strings which contain ab followed by 2 or more c |
| abc{2,5} | Finds strings which contain ab followed by 2 or up to 5 c |
| a(bc)* | Finds strings which contain a followed by 0 or more bc |
| a(bc){2,5} | Finds strings which contain a followed by 2 or up to 5 bc |
OR-Operator
Regex does not have an AND-Operator, only an OR.
| Regex | Explanation |
|---|---|
| a(b|c) | Finds strings which contain a followed by b or c (possible: ab, ac, abc, acb) |
| a[bc] | Just like before, but this time without capturing b or c |
Character classes
| Regex | Explanation |
|---|---|
| \d | Finds digits |
| \w | finds word characters (alphanumeric and _) |
| \s | finds a whitespace (including tabs and line breaks) |
| . | finds any character (should be used carefully, other classes are faster and more precise) |
| \D | negation of \d |
| \W | negation of \w |
| \S | negation of \s |
You can also search for non printable chars like \t, \n and \r. To search for special chars, simply escape with \ like this: \:, \$, \., \{, \[ ...
An example for a combination: The expression \$\d finds the $-sign in front of digits.
Flags
You can define flags at the end of an expression:
| Regex | Explanation |
|---|---|
| g | (global) does not return after the first match, restarting the subsequent searches from the end of the previous match |
| m | (multi-line) when enabled ^ and $ will match the start and end of a line, instead of the whole string |
| i | (insensitive) makes the whole expression case-insensitive (for instance /aBc/i would match AbC) |
Grouping and capturing
| Regex | Explanation |
|---|---|
| a(bc) | parentheses create a capturing group (possible: abc) |
| a(?:bc)* | using ?: disables the capturing group (possible: a) |
| a(?<foo>bc) | using ?<foo> will give the group the name foo |
Naming a group enables the possibility to search through the result like a dictionary (data type), whereby the keys correspond to the respective name of the capturing group.
Bracket expressions
| Regex | Explanation |
|---|---|
| [abc] | finds strings that either have one a or b or c (equal to a|b|c) |
| [a-c] | same as before (scope from a to c) |
| [a-fA-F0-9] | finds strings that either have one a to f or A to F or 0 to 9 (hexadecimal btw.) |
| [0-9]% | finds strings that have 0 to 9 followed by a % |
| [^a-zA-Z] | finds strings that have no letter from a to z or from A to Z (in this case ^ is used as negation of the expression) |
Important to note: Every regex and even the \ will lose its meaning inside the [ ] and don't need to be escaped.
Greedy and Lazy match
The quantifiers (* ++ { }) are greedy operators, so they expand the match as far as they can through the provided text.
Given this string: This is a <div>simple div</div> test
The regex <.+> will find <div>simple div</div>. To only find the the tags <div> and </div>, use ? to make the expression lazy: <.+?>
An even better expression would be (to avoid the . operator): <[^<>]+> - Explanation: Matches every sign, except < and > 1 or more times inside < and >.
Advanced stuff
Boundaries
| Regex | Explanation |
|---|---|
| \babc\b | only searches for "whole words" (possible: abc, -abc/ - not possible: ab, abcc, babc |
The \b operator is similar to the anchors ^ and $, where one side is a word character (such as \w) and the other side is a non-word character (for example the beginning of a string or a space).
It also has a negation \B:
| Regex | Explanation |
|---|---|
| \Babc\B | matches only if the pattern is fully surrounded by word characters (possible: babcd - not possible: ab, abc, abcc |
Back-references
| Regex | Explanation |
|---|---|
| ([abc])\1 | using \1 it matches the same text that was matched by the first capturing group |
| ([abc])([de])\2\1 | use \2 (\3, \4, etc.) to identify the same text that was matched by the second (third, fourth, etc.) capturing group |
| (?<foo>[abc])\k<foo> | the foo group is referenced later (\k<foo>). The result is the same of the first regex. |
Look-ahead and Look-behind
| Regex | Explanation |
|---|---|
| d(?=r) | matches a d only if is followed by r, but r will not be part of the overall regex match |
| (?<=r)d | matches a d only if is preceded by an r, but r will not be part of the overall regex match |
This can also be negated:
| Regex | Explanation |
|---|---|
| d(?!r) | matches a d only if is not followed by r, but r will not be part of the overall regex match |
| (?<!r)d | matches a d if is not preceded by an r, but r will not be part of the overall regex match |
Useful expressions
| Regex | Explanation |
|---|---|
| ^(.*)(\r?\n\1)+$ | Finds duplicates in consecutive lines (unique, unique, duplicate, duplicate, unique ...) |
| <[^>]*> | Removes all HTML tags: <b>test</b> becomes test and <a href="https://www.google.de/">Google</a> becomes Google |
| ^(?:[\t ]*(?:\r?\n|\r))+ | Removes all empty lines |
Hint: This article is based and extended uppon this post I found during my research.