PHP extension for RE2, "an efficient, principled regular expression library"
re2 is a PHP extension which provides an interface to Google’s RE2 regular-expression library.
Backtracking engines are typically full of features and convenient syntactic sugar but can be forced into taking exponential amounts of time on even small inputs. RE2 uses automata theory to guarantee that regular expression searches run in time linear in the size of the input. RE2 implements memory limits, so that searches can be constrained to a fixed amount of memory. RE2 is engineered to use a small fixed C++ stack footprint no matter what inputs or regular expressions it must process; thus RE2 is useful in multithreaded environments where thread stacks cannot grow arbitrarily large.
On large inputs, RE2 is often much faster than backtracking engines; its use of automata theory lets it apply optimizations that the others cannot.
Unlike most automata-based engines, RE2 implements almost all the common Perl and PCRE features and syntactic sugars. It also finds the leftmost-first match, the same match that Perl would, and can return submatch information. The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently. The syntax page gives full details.
<?php
$subject = 'Hello regex world';
re2_match_all('\w+', $subject, $matches);
print_r($matches);
/*
Array
(
[0] => Array
(
[0] => Hello
[1] => regex
[2] => world
)
)
*/
re2_match_all('\w(\w+)', $subject, $matches, RE2_SET_ORDER);
print_r($matches);
/*
Array
(
[0] => Array
(
[0] => Hello
[1] => ello
)
[1] => Array
(
[0] => regex
[1] => egex
)
[2] => Array
(
[0] => world
[1] => orld
)
)
*/
echo re2_replace('\w+', 'foo', $subject), "\n";
/*
foo foo foo
*/
echo re2_replace('\w+', 'foo', $subject, 1), "\n";
/*
foo regex world
*/
echo re2_replace_callback('\w+', function($m) { return strtoupper($m[0]); }, $subject, 2), "\n";
/*
HELLO REGEX world
*/
?>
The interface is intended to follow ext/pcre (preg_match()
et al) as closely as possible.
The main differences are:
Returns whether the pattern matches the subject.
Returns how many times the pattern matched the subject.
Replaces all matches of the pattern with the replacement.
Replaces all matches of the pattern with the value returned by the replacement callback.
Replaces all matches of the pattern with the replacement. Returns only the subjects where there was a match.
Return array entries which match the pattern (or which don’t, with RE2_GREP_INVERT.)
Escapes all potentially meaningful regexp characters in the subject.
Represents a compiled regex pattern.
Construct a new Re2 object.
If $force_cache
is true
the cache will be used regardless of the re2.cache_enabled ini setting.
Returns the pattern.
Returns the options used for this pattern.
Options to be used for a particular pattern.
Construct a new Re2Options object.
Default “utf8”.
The encoding to use for the pattern and subject strings, “utf8” or “latin1”.
Default 8388608 (65KB).
The max_mem option controls how much memory can be used
to hold the compiled form of the regexp (the Prog) and
its cached DFA graphs. Code Search placed limits on the number
of Prog instructions and DFA states: 10,000 for both.
In RE2, those limits would translate to about 240 KB per Prog
and perhaps 2.5 MB per DFA (DFA state sizes vary by regexp; RE2 does a
better job of keeping them small than Code Search did).
Each RE2 has two Progs (one forward, one reverse), and each Prog
can have two DFAs (one first match, one longest match).The RE2 memory budget is statically divided between the two
Progs and then the DFAs: two thirds to the forward Prog
and one third to the reverse Prog. The forward Prog gives half
of what it has left over to each of its DFAs. The reverse Prog
gives it all to its longest-match DFA.Once a DFA fills its budget, it flushes its cache and starts over.
If this happens too often, RE2 falls back on the NFA implementation.
Default false
.
Restrict patterns to POSIX egrep syntax.
Default false
.
Search for the longest match instead of the first match.
Default true
.
Write syntax and execution errors to stderr.
Default false
.
Interpret pattern as literal, not regex.
Default false
.
Never match \n
, even in regex.
Default true
.
Match is case-sensitive (regexp can override with (?i) unless in posix_syntax mode)
Default false
.
Allow Perl’s \d \s \w \D \S \W
when in posix_syntax mode.
Default false
.
Allow \b \B
(word boundary and not) when in posix_syntax mode.
Default false
.
^
and $
only match beginning and end of text when in posix_syntax mode.
When set to true
, uses a cache (per process) to store all compiled patterns. The cache can be used even when re2.cache_enabled
is set to false
by passing the $force_cache
parameter to the Re2 constructor.