[Home] [Downloads] [Search] [Help/forum]

Regular expression matching

Regular expressions are powerful things that can match ordinary strings, combined with metacharacters. Metacharacters are special sequences that match things that are not fixed.

MUSHclient uses the Perl-Compatible Regular Expression (PCRE) matching engine, written by Philip Hazel. Its HTML documentation can be found at http://mushclient.com/pcre/ - in particular the syntax for regular expressions, discussed in more detail than in this document, can be found at http://mushclient.com/pcre/pcrepattern.html (Regular expressions supported by PCRE).

For a lengthy discussion about regular expression syntax, see also the forum post: Regular Expression Tips.


Simple example:


You see (.*)

would match 'You see (anything)'.

The brackets in the above example mean the 'anything' sequence, whatever it is, is captured as 'wildcard 1', which can then be used in a trigger, alias or script.


You can match more precisely, for example:


You see a (kobold|bear|gorilla)

matches one of those three specific words.


You can also do things like testing for something that does not match, for example:


You see a (kobold|bear|gorilla)(?! in the next room)

This matches 'You see a kobold' but not 'You see a kobold in the next room'.


Table of metacharacters

Character Description
\ Marks the next character as either a special character, a literal, a backreference, or an octal escape. For example, 'n' matches the character "n". '\n' matches a newline character. The sequence '\\' matches "\" and "\(" matches "(".
^ Matches the position at the beginning of the input string.
$ Matches the position at the end of the input string.
* Matches the preceding subexpression zero or more times. For example, zo* matches "z" and "zoo". Also, '(food)*' matches 'food', 'foodfood', 'foodfoodfood' and so on.
* is equivalent to {0,}.
+ Matches the preceding subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z".
+ is equivalent to {1,}.
? Matches the preceding subexpression zero or one time. For example, "do(es)?" matches the "do" in "do" or "does".
? is equivalent to {0,1}
{n} n is a nonnegative integer. Matches exactly n times. For example, 'o{2}' does not match the 'o' in "Bob," but matches the two o's in "food".
{n,} n is a nonnegative integer. Matches at least n times. For example, 'o{2,}' does not match the "o" in "Bob" and matches all the o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m} m and n are nonnegative integers, where n <= m. Matches at least n and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note that you cannot put a space between the comma and the numbers.
? When this character immediately follows any of the other quantifiers (*, +, ?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. A non-greedy pattern matches as little of the searched string as possible, whereas the default greedy pattern matches as much of the searched string as possible. For example, in the string "oooo", 'o+?' matches a single "o", while 'o+' matches all 'o's.
. (period) Matches any single character.
(pattern) Matches pattern and captures the match. The captured match can be retrieved from the results by using the "wildcards" argument in triggers or aliases, or by using %1 through to %9 in the replacement text. To match parentheses characters ( ), use '\(' or '\)'.
(?:pattern) Matches pattern but does not capture the match, that is, it is a non-capturing match that is not stored for possible later use. This is useful for combining parts of a pattern with the "or" character (|). For example, 'industr(?:y|ies) is a more economical expression than 'industry|industries'. It is also useful in situations where you need to match something, but are running out of wildcards.
(?=pattern) Positive lookahead matches the search string at any point where a string matching pattern begins. This is a non-capturing match, that is, the match is not captured for possible later use. For example 'foo(?=bar|bah)' matches the "foo" in "foobar" or "foobah" but not in "food". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match begins immediately following the last match, not after the characters that comprised the lookahead.
(?!pattern) Negative lookahead matches the search string at any point where a string not matching pattern begins. This is a non-capturing match, that is, the match is not captured for possible later use. For example, 'foo(?!bar)' matches any occurrence of "foo" that is not followed by "bar". Note that the apparently similar pattern (?!foo)bar does not find an occurrence of "bar" that is preceded by something other than "foo"; it finds any occurrence of "bar" whatsoever, because the assertion (?!foo) is always true when the next three characters are "bar". A lookbehind assertion is needed to achieve this effect. Lookaheads do not consume characters, that is, after a match occurs, the search for the next match begins immediately following the last match, not after the characters that comprised the lookahead.
(?<=pattern) Positive lookbehind matches the search string at any point where a string matching pattern begins.. For example, '(?<=foo)bar' finds an occurrence of "bar" that is preceded by "foo". The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. however, if there are several alternatives, they do not all have to have the same fixed length. thus '(?<=bullock|donkey)' is permitted, but '(?<=dogs?|cats?)' causes an error at compile time.
(?<!pattern) Negative lookbehind matches the search string at any point where a string not matching pattern begins.. For example, '(?<!foo)bar' finds an occurrence of "bar" that is not preceded by "foo". The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. however, if there are several alternatives, they do not all have to have the same fixed length. thus '(?<!bullock|donkey)' is permitted, but '(?<!dogs?|cats?)' causes an error at compile time.
x|y Matches either x or y. For example, 'z|food' matches "z" or "food". '(z|f)ood' matches "zood" or "food".
'(fish|chips|sauce)' matches 'fish', 'chips', or 'sauce'.
[xyz] A character set. Matches any one of the enclosed characters. For example, '[abc]' matches the 'a' in "plain".
[^xyz] A negative character set. Matches any character not enclosed. For example, '[^abc]' matches the 'p' in "plain".
[a-z] A range of characters. Matches any character in the specified range. For example, '[a-z]' matches any lowercase alphabetic character in the range 'a' through 'z'.
[^a-z] A negative range characters. Matches any character not in the specified range. For example, '[^a-z]' matches any character not in the range 'a' through 'z'.
\b Matches a word boundary, that is, the position between a word and a space. For example, 'er\b' matches the 'er' in "never" but not the 'er' in "verb".
\B Matches a nonword boundary. 'er\B' matches the 'er' in "verb" but not the 'er' in "never".
\cx Matches the control character indicated by x. For example, \cM matches a Control-M or carriage return character. The value of x must be in the range of A-Z or a-z. If not, c is assumed to be a literal 'c' character. The precise effect of "\cx" is as follows: if "x" is a lower case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. Thus "\cz" becomes hex 1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
\d Matches a digit character. Equivalent to [0-9].
\D Matches a nondigit character. Equivalent to [^0-9].
\f Matches a form-feed character. Equivalent to \x0c and \cL.
\n Matches a newline character. Equivalent to \x0a and \cJ. Note that in MUSHclient newline characters will not occur in triggers or aliases as they have already been extracted prior to trigger/alias matching.
\r Matches a carriage return character. Equivalent to \x0d and \cM.
\s Matches any whitespace character including space, tab, form-feed, etc. Equivalent to [\f\n\r\t\v].
\S Matches any non-white space character. Equivalent to [^\f\n\r\t\v].
\t Matches a tab character. Equivalent to \x09 and \cI.
\v Matches a vertical tab character. Equivalent to \x0b and \cK.
\w Matches any word character including underscore. Equivalent to '[A-Za-z0-9_]'.
\W Matches any nonword character. Equivalent to '[^A-Za-z0-9_]'.
\xn Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. For example, '\x41' matches "A". '\x041' is equivalent to '\x04' & "1". Allows ASCII codes to be used in regular expressions.
\num Matches num, where num is a positive integer. A reference back to captured matches. For example, '(.)\1' matches two consecutive identical characters.
\n Identifies either an octal escape value or a backreference. If \n is preceded by at least n captured subexpressions, n is a backreference. Otherwise, n is an octal escape value if n is an octal digit (0-7).
\nm Identifies either an octal escape value or a backreference. If \nm is preceded by at least nm captured subexpressions, nm is a backreference. If \nm is preceded by at least n captures, n is a backreference followed by literal m. If neither of the preceding conditions exists, \nm matches octal escape value nm when n and m are octal digits (0-7).
\nml Matches octal escape value nml when n is an octal digit (0-3) and m and l are octal digits (0-7).
[[:alnum:]] A character set matching letters and digits. Use [[:^alnum:]] to match anything other than letters and digits.
[[:alpha:]] A character set matching letters. Use [[:^alpha:]] to match anything other than letters.
[[:ascii:]] A character set matching character codes 0 - 127. Use [[:^ascii:]] to match anything other than character codes 0 - 127.
[[:cntrl:]] A character set matching control characterss. Use [[:^cntrl:]] to match anything other than control characters.
[[:digit:]] A character set matching digits (same as \d). Use [[:^digit:]] to match anything other than digits.
[[:graph:]] A character set matching printing characters, excluding space. Use [[:^graph:]] to match anything other than these.
[[:lower:]] A character set matching lower case letters. Use [[:^lower:]] to match anything other than lower case letters.
[[:print:]] A character set matching printing characters, including space. Use [[:^print:]] to match anything other than these.
[[:punct:]] A character set matching printing characters, excluding letters and digits. Use [[:^punct:]] to match anything other than these.
[[:space:]] A character set matching white space (same as \s). Use [[:^space:]] to match anything other than white space.
[[:upper:]] A character set matching upper case letters. Use [[:^upper:]] to match anything other than upper case letters.
[[:word:]] A character set matching "word" characters (same as \w). Use [[:^ word:]] to match anything other than "word" characters.
[[:xdigit:]] A character set matching hexadecimal digits. Use [[:^xdigit:]] to match anything other than hexadecimal digits.
[Back] [Home]
Written by Nick Gammon - 5K

Comments to Gammon Software support

[Best viewed with any browser - 2K]    Internet Contents Rating Association (ICRA) - 2K    [Web site powered by FutureQuest.Net]

Page updated on Tuesday, 6 December 2005