Regular expression engine in Qt5

This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.

Current issues with QRegExp

From http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html

High level issues

  • QRegExp API is broken (see T7 in Low Level)
  • QRegExp is used for QtScript though it does not fullfill the ECMAScript specification ECMA-262-1999 [ecma-international.org] . Missing features include
    • Non-greedy quantifiers (see page 141 titled “- 129 -”)
    • But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
  • Patternist/XPath also needs Regex features not found in QRegExp, including
  • Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of T6 described below. GtkSourceView has exactly that problem [bugzilla.gnome.org]
  • Customer complained about QRegExp (though I don’t see what’s their exact problem):
    • In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
  • Public task request:

Low Level issues

  • T1: ^ (caret) and $ (dollar) cannot match at each newline
  • T2: . (dot) always matches newlines
  • T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
  • T4: lookbehind is not supported (lookahead is)
  • T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn(“abcd”) for pattern “.*” returns 3, not 0
  • T6: only linear input is supported, for a text editor like Kate this does not scale
  • T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.

Future

  • It must be a solid 3rd party engine — don’t want to develop an in-house engine and maintain it.
  • QRegExp likely to be moved into its own module in order to keep source compatibility.
  • Addresses the above low-level issues (as much as possible).
  • (Nice to have) At least the same syntax / features than std::regex

Proposed libraries

  • PCRE [pcre.org]
  • “V8”
  • ICU [userguide.icu-project.org]
  • Boost.Regex [boost.org]
  • std::regex (new in C++11)
  • RE2 [code.google.com]

Feature matrix

QRegExp PCREV8ICUBoost.Regexstd::regexRE2
General commentsSee above.
Already being used in Qt? Yes Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. Yes (Qt 5) libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. No No No
Pros Widely used, de-facto standard implementation for Perl-like regexps. Uses UTF-16 natively. very fast, use a DFA
Cons Uses UTF-8, thus requiring string and indexes conversion from/to UTF-16 (QString). UTF-16 support is being developed.Does not run on every platform supported by QtCore / QtBase2. Boost does not give guarantees about ABI compatibility. uses UTF-8, doesn’t have the lookbehind neither lookahead
Fixes T1
Fixes T2
Fixes T3 ?
Fixes T4
Fixes T5 ? ?
Fixes T6 ✔ (“by hand”, with partial matching) Maybe yes, see UText [userguide.icu-project.org] . ✔ see StringPiece
Fixes T7

✘✔

Supported syntax: Characters

QRegExp PCREV8ICUBoost.Regexstd::regexRE2
\aBELL
\Abeginning of input
\b inside a [set]BACKSPACE?
\b outside a [set]on a word boundary
\Bnot on a word boundary
\cXASCII control character X
\ddigit
\Dnon digit
\eESCAPE
\Eend of \Q … \E quoting
\fFORM FEED
\Gend of previous match
\nLINE FEED
\N{x}UNICODE CHARACTER NAME x
\p{x}UNICODE PROPERTY NAME x
\P{x}UNICODE PROPERTY NAME not x
\Qstart of \Q … \E quoting
\rCARRIAGE RETURN
\swhite space
\Snon white space
\tHORIZONTAL TAB
\uhhhhU+hhhh (between U+0000 and U+FFFF)
\UhhhhhhhhU+hhhhhhhh (between U+00000000 and U+0010FFFF)
\vVERTICAL TAB
\wword character
\Wnon word character
\x{hhhh}U+hhhh✔ (0-10FFFF) ✔ (0-10FFFF)
\xhhhhU+hhhh✔ (0000-FFFF)✔ (00-FF) ✔ (00-FF)
\Xgrapheme cluster
\Zend of input (or before the final \n)
\zend of input
\nn-th backreference
\0oooASCII/Latin-1 character 0ooo
.any character but newlines
^line beginning
$line end
\quote the following symbol
[pattern]set

Supported syntax: Operators

Operator QRegExp PCREV8ICUBoost.Regexstd::regexRE2
* match 0 or more times ??
+ match 1 or more times
? match 0 or 1 times
{n} match n times
{n,} match n or more times
{n,m} match between n and m times
*? match 0 or more times, not greedy
? match 1 or more times, not greedy
?? match 0 or 1 times, not greedy
{n}? match n times
{n,}? match n or more times, not greedy
{n,m}? match between n and m times, not greedy
* match 0 or more times, possessive
++ match 1 or more times, possessive
?+ match 0 or 1 times, possessive
{n}+ match n times
{n,}+ match n or more times, possessive
{n,m}+ match between n and m times, possessive
( … ) capturing group
(?: … ) group
(?> … ) atomic grouping
(?# … ) comment
(?= … ) look-ahead assertion
(?! … ) negative look-ahead assertion
(?<= … ) look-behind assertion
(?<! … ) negative look-behind assertion
(?flags: … ) flags change
(?flags) flags change
(?P<name> …) named capturing group
(?<name> …) named capturing group
(?‘name’ …) named capturing group
(?| …) branch reset

Supported syntax: flags

Flag QRegExp PCREV8ICUBoost.Regexstd::regexRE2
/i case insensitive
/m multi-line
/s dot matches anything~1
/x ignore whitespace and comments
/U minimal match

Benchmarks

See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.

1 It’s actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.

2 Cf. http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-August/000962.html

Categories: