I am pleased to announce the release of a Regular Expression library for the Arduino and other applications where a compact library is required.
The examples given are for the Arduino IDE, but the library itself does not use any specific Arduino function calls. It could be used in PIC processors, or anywhere (like a C++ program) where you simply want a compact, fast, powerful regular expression matcher.
What are regular expressions?
If you haven't used them before, they are incredibly powerful ways of parsing text in strings. For example, you can look for a word starting with an upper-case letter, a hex string, a number with an optional leading minus sign, and so on.
They provide a somewhat easier and more powerful way of breaking up strings than using library functions like strtok, strcmp, strstr etc.
Download
The library can be downloaded from:
http://gammon.com.au/Arduino/Regexp.zip (9Kb).
[EDIT] Also at: https://github.com/nickgammon/Regexp
Add the file Regexp.cpp to your project. The "include" file Regexp.h has the relevant declarations in it. If you are using this on the Arduino the file keywords.txt has the library keywords in it.
To install on the Arduino, simply unzip the folder into your "libraries" folder.
[EDIT]
Uploaded version 1.2 on 19th May 2011. This now has extra functions for doing replacements, and global (multiple) finds, and counting how many matching strings there are.
Design
My design criteria for this library were:
- Be powerful enough to be useful (ie. not a "toy")
- Be fast enough that it could be used to do things like process input from GPS devices, RFID readers, web pages etc.
- Be compact enough that it doesn't use most of the available memory on a microprocessor
I believe these have been met, as follows:
Power
The library processes regular expressions which are identical in syntax to those used by the Lua string.match, string.find and similar functions. This is because the code is adapted from the Lua code written by Roberto Ierusalimschy. It has been adapated enough to make it work outside the Lua structure basically.
Lua regular expressions are well-known and well-documented. Their power is such that (for example) very extensive add-ons for the World of Warcraft game are written in Lua, and use the regular expression matching to break up incoming data from the server.
My own documentation for Lua regular expressions is here:
http://www.gammon.com.au/scripts/doc.php?lua=string.find
You can not only match strings like "%d+" (for one or more digits) but you can specify "captures" which means each captured substring has its position returned, so you can easily extract it out from the original string.
There are some examples here too:
http://gammon.com.au/forum/?id=6034
That page mentions the string.find and string.match in Lua, but the part about the patterns themselves would apply.
Speed
The Lua regular expression matcher has been well-regarded for its speed, and this library performs well too. For example:
String to parse: "Testing: answer=42"
Regular expression: "(%a+)=(%d+)"
Time taken to match: around 2 milliseconds on an Arduino microprocessor running at 16 MHz.
This test returned the matching text ("answer=42"), its length, plus the two captures ("answer" and "42").
match start was 9
match length was 9
Match text: 'answer=42'
Captures: 2
Capture number: 1
Text: 'answer'
Capture number: 2
Text: '42'
This shows how easily you can use regular expressions to parse incoming text (eg. GPS data in the form keyword=value).
Size
The library takes about 2392 bytes on the Arduino. For example a minimal test would be:
#include <Regexp.h>
void setup ()
{
MatchState ms;
ms.Target ("cat"); // what to search
char result = ms.Match ("a"); // look for "a"
} // end of setup
void loop () {}
This compiles to be 2842 bytes. However an "empty" sketch is 450 bytes, so the regular expression library has added 2392 bytes (2.33 Kb).
I believe this is an acceptable length. This is around 7% of the memory on a 32 Kb device. You can reduce the memory slightly by reducing the number of captures it supports (currently 32). Alternatively, if you need to do dozens of captures you can do that by changing one define, at the cost of 4 bytes per capture.
Error handling
The library "throws" exceptions by doing a non-local goto (longjmp), in exactly the same way Lua does. This keeps the code compact and simple. If there is a parsing problem then the library returns a negative number as the result of the regexp call. You can interpret those to tidy up your regular expressions to make them work properly.
It does not use C++ exceptions because they are not supported by the Arduino IDE.
Usage
You need to include the library:
Since, unlike Lua, functions cannot return multiple results (eg. all the captures) the MatchState structure is used to communicate with the library. You start by setting up the string to be searched, and its length:
MatchState ms;
ms.Target ("Testing: answer=42");
You can supply either a zero-terminated string (like the above) or a char buffer and a length.
Then you call the Match method of the MatchState structure, supplying the regular expression string itself, and an zero-relative offset to commence searching from. The function returns 1 on a match, 0 on no match, and a negative number for a parsing error.
char result = ms.Match ("(%a+)=(%d+)");
if (result == REGEXP_MATCHED)
{
// matching offsets in ms.capture
}
else if (result == REGEXP_NOMATCH)
{
// no match
}
else
{
// some sort of error
}
The meanings of the various error codes are defined in Regexp.h.
If and only if you get a REGEXP_MATCHED result (that is, 1) then the captures array in the MatchState structure is set up to indicate what the address and length of each capture substring is. You can use that information to index into your supplied string and extract out the substrings, if required.
For example:
char buf [100]; // large enough to hold expected string
Serial.print ("Captures: ");
Serial.println (ms.level);
for (int j = 0; j < ms.level; j++)
{
Serial.print ("Capture number: ");
Serial.println (j, DEC);
Serial.print ("Text: '");
Serial.print (ms.GetCapture (buf, j));
Serial.println ("'");
} // end of for each capture
Also the matching text itself (the whole match) is stored as an offset and length. You could index into your original string to extract out the matching text, if required. It may not be required. You may only need to know if a match occurred, or not.
char buf [100]; // large enough to hold expected string
Serial.print ("Matched on: ");
Serial.println (ms.GetMatch (buf));
The library does not attempt to "pre-extract" those strings for you on the grounds that to do so would take extra time and memory, which you, the user of the library, may not care to expend.
MatchState methods
void MatchState::Target (char * s);
void MatchState::Target (char * s, const unsigned int len);
These let you supply the string to be searched (the target string). It can be null-terminated, in which case the library finds the end by doing a strlen on it, or you supply the length. If you have built up a buffer from incoming text you may prefer to just supply the length. You can also provide the target string on the MatchState constructor.
char MatchState::Match (const char * pattern, unsigned int index = 0);
This performs the match based on the supplied null-terminated pattern, and starting at the supplied index into the target string. By modifying the index parameter you can re-match further and further through the same target string, perhaps to keep finding the same sort of string (eg. a word).
The result of the match will be > 0 if successful, 0 if no match, and < 0 if an error occurred (invalid regular expression).
char * MatchState::GetMatch (char * s) const;
After a successful match, this copies the matching string from the target buffer to another memory location, with a null-terminator. Thus you must allocate enough memory to hold the matching string, plus one for the 0x00 byte at the end. You could either statically allocate a buffer (as in the examples above) or do a malloc based on MatchLength which is calculated during the match. If no successful match was previously done, then an empty string is copied.
The supplied buffer is also returned from the function so you can directly use it in a Serial.println function or similar.
char * MatchState::GetCapture (char * s, const int n) const;
After a successful match, this copies the specified capture string from the target buffer to another memory location, with a null-terminator. Thus you must allocate enough memory to hold the matching string, plus one for the 0x00 byte at the end. You could either statically allocate a buffer (as in the examples above) or do a malloc based on capture [n].len which is calculated during the match. If no successful match was previously done, or this capture does not exist, then an empty string is copied.
The supplied buffer is also returned from the function so you can directly use it in a Serial.println function or similar.
The variable "levels" in the MatchState object tells you how many captures were found. This could be zero if you did not use round brackets in the regular expression.
The match number is zero-relative, so the first match is match number 0.
unsigned int MatchCount (const char * pattern);
This applies the pattern to the target buffer repeatedly, and returns the count of matches (which could be zero of course). This could be used, for example, to count how many spaces are in a string.
unsigned int GlobalMatch (const char * pattern, GlobalMatchCallback f);
This applies the pattern to the target buffer repeatedly, and for each match calls the function f which must be declared like this:
void fname (const char * match, // matching string (not null-terminated)
const unsigned int length, // length of matching string
const MatchState & ms); // MatchState in use (to get captures)
This function gets a pointer to the matching string in the actual target buffer, thus it will not be null-terminated. You need to use the length as well to work out how long it is. You could, for example, memcpy or memcmp the string. If you want to get the captures for each match then use the MatchState variable in the way you would for ordinary matches.
This function could be used to break a string up into words (for example) and for each word do something with it.
unsigned int GlobalReplace (const char * pattern, GlobalReplaceCallback f, const unsigned int max_count = 0);
This applies the pattern to the target buffer repeatedly, and for each match calls the function f which must be declared like this:
void fname (const char * match, // matching string (not null-terminated)
const unsigned int length, // length of matching string
char * & replacement, // what to replace with
unsigned int & replacement_length, // how long the replacement is
const MatchState & ms); // MatchState in use (to get captures)
This function gets a pointer to the matching string in the actual target buffer, thus it will not be null-terminated. You need to use the length as well to work out how long it is. If the function does nothing, then the original string will be retained (thus nothing will change).
However the function can set the replacement variable to a new string, and the replacement_length variable to a new length. The new string must still be in scope when the function returns. Thus, you could use a static variable, or a literal string.
If max_count is zero then as many replacements as possible are done, otherwise it stops after doing max_count replacements.
unsigned int GlobalReplace (const char * pattern, char * replacement, const unsigned int max_count = 0);
This applies the pattern to the target buffer repeatedly, and for each match replaces the matched string with the replacement string. This can be used for simple replacements, like replacing multiple spaces with a single space.
If max_count is zero then as many replacements as possible are done, otherwise it stops after doing max_count replacements.
For both replacement functions the replacements are done "in situ". Thus the buffer holding the original string must be large enough for the replacement, especially if you replace with something longer than the original.
Also a null-terminator is appended to the string for convenience, even if there wasn't one there to begin with. The buffer would need room for that. |