Tag Archives: Lex

Regular Expressions

Regular Expressions (commonly referred to as RegEx or RegExp) are coded strings which represent text patterns. These can be used to search and replace patterns in the texts. A pattern defined by a regular expression can be used in the regular expression processors like grep, sed, awk, Notepad++, Visual Studio etc. for performing search, replace or modifications.

Programming languages also support regular expressions. It is more commonly found in scripting languages like Perl, Python, Ruby etc. but the compiled languages Java and C++ (using Boost, Qt) also support it.

Regular Expressions Syntax and Rules

A regular expression is a string and contains text characters. A few of the text characters have special meaning. These special characters perform various operations like grouping, quantification, NOT etc. Rest of the characters are normal and mean what they are.

List of special characters:  ” \  [  ]  ^ -  ?  .  *  +  |  (  )  $  /  {  }  %  <  >

To use these characters literally in a regex one has to escape them using a backslash (\) or enclose in quotes.

Normal Text

  • a matches a
  • hello matches hello
  • . a special character. It matches any character except a newline
  • \. matches .

Groups

Round brackets or parentheses () are used to create a group.

  • (a|b|c) matches a or b or c
  • (colour|color) matches colour or color

Ranges

Square brackets [] are used for specifying a range of characters.

  • [abc] matches a or b or c
  • [^abc] matches any character except a, b and c
  • [a-c] matches a or b or c, range from a to c
  • [A-C] matches A or B or C, range from A to C
  • [a-cA-C] matches a or b or c or A or B or C
  • [0-5] matches a digit from 0 to 5
  • [a-zA-Z0-9] matches one character of alphanumeric text

Quantifiers

It specifies how many times something is repeated.

  • a* zero or more, matches null, a, aa, aaa, aaaa, …
  • a+ one or more, matches a, aa, aaa, aaaa, …
  • a? zero or one, matches null or a
  • a{3} matches aaa
  • a{3,} three and more aaa, aaaa, aaaaa, …
  • a{3,6} 3 to 6 matches aaa, aaaa, aaaaa, aaaaaa

Eg.
[a-zA-Z0-9]{8} matches alphanumeric text of length 8

Shorthands

Shorthands exist for commonly used character classes. For example digits character class has a shorthand \d. It is short for [0-9]. Each lowercase shorthand character has an associated uppercase shorthand character with the opposite meaning. Thus [\D] matches any character that is not a digit, and is equivalent to [^\d].

  • \d a digit, short for [0-9]
  • \D a non-digit, short for [^0-9]
  • \w a word character, short for [a-zA-Z_0-9]
  • \W a non-word character equivalent to [^a-zA-Z_0-9]
  • \s a whitespace character, short for [ \t\n\x0b\r\f] matches any whitespace character. This includes spaces, tabs, and line breaks.
  • \S [^\s]

Anchors

  • ^ start of the subject, ^a makes sure that a occurs at the the beginning of the subject
  • $ end of the subject, a$ makes sure that a occurs at the the end of the subject
  • \b word boundaries, \ba\b makes sure that a is a whole word in the subject
  • \B nonboundaries, \Ba\B makes sure that a is not a whole word

Examples

  • .at matches any three-character string ending with at e.g. hat, cat and bat
  • [hc]at matches hat or cat
  • [hc]?at matches hat or cat or at
  • 0[xX][A-Fa-f0-9]+ matches hexadecimal number e.g. 0x2AB7
  • [A-Za-z_][A-Za-z_0-9]* identifier in a programming language
  • cal[ae]nd[ae]r matches misspelled calender
  • colou?r matches color or colour
  • gr[ae]u matches grey or gray
  • \d+ matches positive integers
  • -\d+ matches negative integers
  • -{0,1}\d+ matches integers
  • \d*\.{0,1}\d+ matches positive real numbers