A regular expression, commonly abbreviated as regex, or regexp, is a sequence of alphanumeric characters and symbols from the ASCII character set forming a text-matching template.
At its simplest, a regexp can be used for case-sensitive searching. Most characters from the ASCII character set match themselves; hence, Glucose and glucose match themselves, but not each other.
The exceptions are the reserved characters, which are
$ . ^ * ? ( ) \ < > { } [ ] + -To search for these characters, each must be preceded each by a backslash \, a process referred to as escaping:
\$ \. \* \? \( \) \\ \< \> \{ \} \[ \] \+ \-
To search for zero or more occurrences in a text string, append the * symbol to that character. The regexp
a*
would match the null string , as well as a, aa, aaa, etc.
a+
would match one or more occurrences of a, thus: a, aa, or aaa, etc.
ab?
would specify that a is followed optionally by a b.
The full stop, or period, symbol matches any single character except a NEWLINE.
Therefore, the regexp
.*
matches zero or more occurrences of any character. This is a powerful regexp, which should be used with caution since it can match more characters than anticipated.
[ ] -
Square brackets are used to denote ranges of characters, thus [a-z] matches any single lowercase letter; [A-Z] matches any capital letter; [A-Zabc] matches any capital letter or any of the lowercase letters a, b or c (but not d through z); [A-Za-z] would match any single letter, [0-9] any single digit. Ranges can also be combined with the wildcard characters *, + and ?, so that, for example, [0-9]+ matches one or more digits. Since the full stop is a reserved character, to match an EC number the following regexp could be used:
[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+