A .NET wrapper for Google's RE2 regular expression library
Re2.Net is a .NET wrapper for Google’s RE2 regular expression library. It has the benefit of also being a mostly drop-in replacement for .NET’s System.Text.RegularExpressions
namespace.
What does ‘drop-in replacement’ mean? Simple. If you have a project that uses the .NET Regex
, Match
, Group
, Capture
, or associated classes, you can add a reference to Re2.Net.dll
, replace using System.Text.RegularExpressions
with using Re2.Net
, and more often than not, It Just Works.
For a list of things that Re2.Net doesn’t support (yet), see below.
Re2.Net now targets .NET Framework 4.5.2 and is compiled using the Visual C++ 2015 runtime (VC140). It still compiles under .NET 3.5 and Visual C++ 2008 as well.
x86 and x64 platforms are both supported, but because the underlying RE2 library is unmanaged code, each platform requires its own DLL.
Re2.Net is now distributed under the same permissive 3-clause BSD license as RE2 itself. If for some reason those terms still don’t work for you, contact me at [email protected].
Building C++/CLI from source can be a finicky process, especially with older .NET versions (like 3.5) and VC++ compilers (like 2008). If you just want binaries, use the Download link above. The LICENSES file and XML documentation are included.
If you are interested in building from source, see the C++/CLI Notes and Lib Requirements sections at the top of Regex.h
. It all seems to work much better in recent versions of Visual Studio.
Russ Cox, one of the RE2 authors, has written a stellar series on the challenges of implementing regular expressions and the virtues of the RE2 approach. For a brief overview, see the description at the public RE2 repository.
In short, RE2 is automata-driven and uses NFAs and DFAs to find matches. This method guarantees linear running time and bounded memory consumption, but doesn’t support backreferences or generalized assertions (lookahead and lookbehind).
.NET, on the other hand, uses a typical backtracking implementation. This method guarantees neither running time (whence the Regex.MatchTimeout
property) nor memory consumption, and can even result in catastrophic failure, but it allows for backreferences and generalized assertions.
All IsMatch()
, Match()
, and Matches()
methods accept byte[]
in addition to string
arguments. Note that when matching a byte[]
, Capture.Index
and Capture.Length
refer to the byte[]
, and not the string
returned by Capture.Value
. In practice, this means that
Debug.Assert(Regex.Match("水DŽ", "水DŽ").Length == 2);
Debug.Assert(Regex.Match(Encoding.UTF8.GetBytes("水DŽ"), "水DŽ").Length == 5);
will both pass, and in both cases Capture.Value
is "水DŽ"
.
Re2.Net adds RegexOptions
for explicit support of Latin-1 and ASCII, as well as a POSIX mode. See Different in Re2.Net for more about encodings.
By combining byte[]
arguments with RegexOptions.Latin1
(or RegexOptions.ASCII
, for byte values 0-127), Re2.Net provides true byte-level searching.
Because RE2 is automata-driven, in Re2.Net Regex
memory consumption is configurable using the maxMemory
constructor parameter.
Re2.Net constructors and static matching methods have no matchTimeout
parameter, because RE2 is immune to the problem (catastrophic failure) that the matchTimeout
parameter solves. The Regex.MatchTimeout
property has accordingly been removed.
RE2 accepts most, but not all, .NET Regex syntax. For the full list, see the RE2 Syntax page. One notable difference between RE2 and .NET Regex is the format for named capturing groups. RE2 uses (?P<name>regex)
, while .NET Regex allows both (?<name>regex)
and (?'name'regex)
. Neither accepts the other’s syntax.
RE2 doesn’t allow Unicode in named capturing groups. (?<水>regex)
is legal for .NET Regex, but RE2 won’t accept it (or rather, won’t accept (?P<水>regex)
).
RE2 only retains the final Capture
of each Group
. In other words, Regex.Match("abcd", "(ab|cd)+")
captures "abcd"
for the Match
, but only "cd"
for the Group
. In .NET Regex, the Group
would hold two captures: "ab"
and "cd"
. (Of course, 99% of the time you only care about capturing the Match
as a whole. Re2.Net provides RegexOptions.SingleCapture
for just such occasions.)
Although an implementation detail, it should be noted that RE2 only accepts UTF-8 and Latin-1 encodings, while .NET encodes all strings in UTF-16. Re2.Net resolves this discrepancy by converting the UTF-16 .NET strings to UTF-8, Latin-1, or ASCII (a subset of Latin-1), depending on the chosen RegexOptions
. In general there’s no need to give any thought to this, but in the specific case of searching a byte[]
for a UTF-16 pattern, Re2.Net will fail to find a match because the UTF-16 pattern has been converted to UTF-8 while the byte[]
has not. In the future I may add RegexOptions.UTF16
and RegexOptions.UTF8
to allow users to make their intent clear, and, if RegexOptions.UTF16
is combined with a byte[]
argument, throw an exception. In the meantime, avoid combining byte[]
arguments with patterns that you require to be UTF-16-encoded.
Some RegexOptions
from .NET Regex are not present in Re2.Net (and vice-versa).
The Regex.Replace()
and Match.Result()
methods are not yet supported, but will be.
Serialization is not currently supported. RE2 constructs its automata more efficiently than a .NET Regex can be deserialized, so there isn’t much reason to add this other than as a compatibility layer.
Re2.Net depends on native code and won’t run anywhere that the native code can’t run (Silverlight, Mono, etc.).
Tests. A small C# benchmarking project is included in the source, but thorough tests haven’t been written yet.
Everything not listed above should line right up. Here are a few examples of .NET Regex quirks that Re2.Net emulates:
// Five matches, each with a value of "".
Debug.Assert(Regex.Matches("xxxx", "").Count == 5);
// Accessing an array (technically, the default Item property) with a negative index? Yup!
Debug.Assert(Regex.Matches("xxxx", "")[0].Groups[-1].Value == "");
// The returned Match containes a Group collection, and the first item in the Group collection is the Match.
var match = Regex.Match("abcd", "abcd");
Debug.Assert((Group)match == match.Groups[0]);
The Re2.Net.Test project pits Re2.Net against .NET Regex where each is strongest: searching strings, for .NET Regex, and for Re2.Net, searching bytes. The testing method is as follows:
mtent12.txt
) is loaded as a byte array for Re2.Net and then encoded as a .NET string for .NET Regex.Regex.Match()
) and once for finding all matches (Regex.Matches()
).A direct string-to-string comparison is also now included.
Twain
(First Match) demonstrates the cost of RE2’s setup and switching between managed and unmanaged code. Because the first instance of Twain
ends a mere 64 positions into the input, .NET Regex manages to find it before Re2.Net can even call into the underlying RE2 library. Search for Twain
(All Matches), on the other hand, and Re2.Net recovers nicely, running twice as fast as .NET Regex.IsMatch(string)
and Match(string)
it may only make sense to use Re2.Net with more complex expressions. Certainly this is the case with mtent12.txt
, which is 20 MB. For Matches(string)
, on the other hand, Re2.Net still wins more often than not, and by large margins.Conclusion: Re2.Net excels at searching raw data, like files or scraped web pages. For simple expressions and very long inputs that already exist as strings, .NET Regex may still be the better option (assuming linear running time, bounded memory consumption, and immunity to pathological expressions aren’t considerations). When searching strings rather than raw data, the characteristics of both the expression and the likely inputs should be taken into account before deciding which implementation to use.
===
Regular Expression | Re2.Net | .NET Regex | Winner |
---|---|---|---|
Twain |
0.05 ms | 0.003 ms | .NET Regex by 16.7x |
^Twain |
22 ms | 4.7 ms | .NET Regex by 4.6x |
Twain$ |
11 ms | 26 ms | Re2.Net by 2.4x |
Huck[a-zA-Z]+|Finn[a-zA-Z]+ |
20 ms | 122 ms | Re2.Net by 6.1x |
a[^x]{20}b |
0.5 ms | 0.007 ms | .NET Regex by 69.9x |
Tom|Sawyer|Huckleberry|Finn |
0.2 ms | 1.3 ms | Re2.Net by 5.7x |
.{0,3}(Tom|Sawyer|Huckleberry|Finn) |
0.7 ms | 20 ms | Re2.Net by 26.2x |
[a-zA-Z]+ing |
0.05 ms | 0.05 ms | Re2.Net by 1.1x |
[a-zA-Z]{0,4}ing[a-zA-Z] |
0.2 ms | 2.3 ms | Re2.Net by 9.6x |
[a-zA-Z]+ing$ |
107 ms | 7300 ms | Re2.Net by 68.2x |
[1]{5,}$ |
110 ms | 4001 ms | Re2.Net by 36.4x |
^.{16,20}$ |
0.4 ms | 0.3 ms | .NET Regex by 1.3x |
([a-f](.[d-m].){0,2}[h-n]){2} |
0.2 ms | 0.02 ms | .NET Regex by 9.1x |
([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z] |
2.9 ms | 135 ms | Re2.Net by 46.2x |
“[^”]{0,30}[?!.]" |
0.8 ms | 0.1 ms | .NET Regex by 5.6x |
Tom.{10,25}river|river.{10,25}Tom |
46 ms | 332 ms | Re2.Net by 7.2x |
===
Regular Expression | Re2.Net | .NET Regex | Winner |
---|---|---|---|
Twain |
14 ms | 28 ms | Re2.Net by 1.9x |
^Twain |
121 ms | 26 ms | .NET Regex by 4.7x |
Twain$ |
11 ms | 27 ms | Re2.Net by 2.5x |
Huck[a-zA-Z]+|Finn[a-zA-Z]+ |
115 ms | 651 ms | Re2.Net by 5.7x |
a[^x]{20}b |
578 ms | 436 ms | .NET Regex by 1.3x |
Tom|Sawyer|Huckleberry|Finn |
127 ms | 689 ms | Re2.Net by 5.4x |
.{0,3}(Tom|Sawyer|Huckleberry|Finn) |
133 ms | 21152 ms | Re2.Net by 158.9x |
[a-zA-Z]+ing |
215 ms | 7195 ms | Re2.Net by 33.4x |
[a-zA-Z]{0,4}ing[a-zA-Z] |
122 ms | 3104 ms | Re2.Net by 25.4x |
[a-zA-Z]+ing$ |
110 ms | 7545 ms | Re2.Net by 68.3x |
[2]{5,}$ |
114 ms | 4129 ms | Re2.Net by 36.3x |
^.{16,20}$ |
134 ms | 3775 ms | Re2.Net by 28.2x |
([a-f](.[d-m].){0,2}[h-n]){2} |
378 ms | 3967 ms | Re2.Net by 10.5x |
([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z] |
123 ms | 5691 ms | Re2.Net by 46.3x |
“[^”]{0,30}[?!.]" |
29 ms | 188 ms | Re2.Net by 6.4x |
Tom.{10,25}river|river.{10,25}Tom |
123 ms | 921 ms | Re2.Net by 7.5x |
===
Regular Expression | Re2.Net | .NET Regex | Winner |
---|---|---|---|
Twain |
108 ms | 0.010 ms | .NET Regex by 11290.9x |
^Twain |
136 ms | 5 ms | .NET Regex by 25.5x |
Twain$ |
141 ms | 28 ms | .NET Regex by 5.0x |
Huck[a-zA-Z]+|Finn[a-zA-Z]+ |
137 ms | 122 ms | .NET Regex by 1.1x |
a[^x]{20}b |
110 ms | 0.01 ms | .NET Regex by 8826.6x |
Tom|Sawyer|Huckleberry|Finn |
111 ms | 1.2 ms | .NET Regex by 92.3x |
.{0,3}(Tom|Sawyer|Huckleberry|Finn) |
106 ms | 23 ms | .NET Regex by 4.7x |
[a-zA-Z]+ing |
112 ms | 0.06 ms | .NET Regex by 1861.7x |
[a-zA-Z]{0,4}ing[a-zA-Z] |
108 ms | 1.8 ms | .NET Regex by 58.6x |
[a-zA-Z]+ing$ |
237 ms | 7101 ms | Re2.Net by 30.0x |
[3]{5,}$ |
230 ms | 3431 ms | Re2.Net by 14.9x |
^.{16,20}$ |
109 ms | 0.3 ms | .NET Regex by 417.2x |
([a-f](.[d-m].){0,2}[h-n]){2} |
108 ms | 0.04 ms | .NET Regex by 2859.1x |
([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z] |
114 ms | 145 ms | Re2.Net by 1.3x |
“[^”]{0,30}[?!.]" |
107 ms | 0.2 ms | .NET Regex by 608.6x |
Tom.{10,25}river|river.{10,25}Tom |
181 ms | 323 ms | Re2.Net by 1.8x |
===
Regular Expression | Re2.Net | .NET Regex | Winner |
---|---|---|---|
Twain |
188 ms | 30 ms | .NET Regex by 6.3x |
^Twain |
276 ms | 28 ms | .NET Regex by 9.8x |
Twain$ |
130 ms | 29 ms | .NET Regex by 4.5x |
Huck[a-zA-Z]+|Finn[a-zA-Z]+ |
265 ms | 648 ms | Re2.Net by 2.4x |
a[^x]{20}b |
811 ms | 385 ms | .NET Regex by 2.1x |
Tom|Sawyer|Huckleberry|Finn |
269 ms | 711 ms | Re2.Net by 2.6x |
.{0,3}(Tom|Sawyer|Huckleberry|Finn) |
320 ms | 24299 ms | Re2.Net by 75.9x |
[a-zA-Z]+ing |
378 ms | 6870 ms | Re2.Net by 18.2x |
[a-zA-Z]{0,4}ing[a-zA-Z] |
285 ms | 2495 ms | Re2.Net by 8.8x |
[a-zA-Z]+ing$ |
226 ms | 7070 ms | Re2.Net by 31.3x |
[4]{5,}$ |
221 ms | 3397 ms | Re2.Net by 15.4x |
^.{16,20}$ |
284 ms | 2982 ms | Re2.Net by 10.5x |
([a-f](.[d-m].){0,2}[h-n]){2} |
634 ms | 4350 ms | Re2.Net by 6.9x |
([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z] |
313 ms | 5613 ms | Re2.Net by 17.9x |
“[^”]{0,30}[?!.]" |
196 ms | 209 ms | Re2.Net by 1.1x |
Tom.{10,25}river|river.{10,25}Tom |
251 ms | 882 ms | Re2.Net by 3.5x |