Sonntag, 6. September 2015

You broke my regular expressions

A quick one to get you to re-think some of the regular expressions you wrote:

What part of

    AAAA

does

    (AA|AAAA)

match?

If you thought AAAA, welcome to POSIX (and C, C++, boost) country, where regular expressions return the longest leftmost match over the whole expression.

If you thought AA, welcome to Perl, Python, Ruby, PCRE, Java, and the like, where alternates are matched left-to-right and only the first one matches.

The latter is called eager alternative matching, but I would rather like to call it short-circuiting. It seems to simplify matching of alternatives and reduce backtracking (and thus memory requirements), or in short: it's faster. It can also be annoying if the engine does not allow you to switch between the different interpretations, because sometimes you really want the real leftmost longest match. And none of the current engines allow that switch.

It's not only alternates, however: Matching A(A)?(AB)? against AAB matches AA instead of AAB with modern modern engines. Left longest match? Nope. Instead, the quantifiers are greedy even if the longest match rules is violated. You need to re-write the above as A(A)??(AB)? (note the double quotation marks) to get POSIX-like behavior.

That greedy matching modifiers even exists is an artifact of dropping leftmost longest semantics. Modern regular expression engines do not perform longest leftmost overall matching. If they claim otherwise, the documentation author is clueless.

So, some time in last 20 years your regular expressions slowly but surely has their semantics changed for the sake of more features and more speed. And it still leads to confusion.

What do we learn from this? Well ... this what happens when you --more or less-- silently change something that was accepted as the norm for quite a long time. It is really quite important to document such subtle but fundamental changes loudly, prominently, and clearly.

And yes, I want my leftmost longest match back at least as an option. Python's regex now has that option since version 2015.09.15. Nice to have and thank you Matthew.

Keine Kommentare:

Kommentar veröffentlichen