How to enable Unicode support for regular expressions?

Just starting out? Need help? Post your questions and find answers here.
The8th
User
User
Posts: 29
Joined: Fri Sep 04, 2015 10:23 am

How to enable Unicode support for regular expressions?

Post by The8th »

In the following example (PB 5.70 LTS X86)

Code: Select all

EnableExplicit
Define s1$ = "Keins glich dem anderen."
Define s2$ = "Bis man es anglich."
Define s3$ = "Man sieht das täglich."
Define s4$ = "Nichts ist unmöglich."

Define regex.l = CreateRegularExpression(#PB_Any, "\bglich\b")
Debug MatchRegularExpression(regex, s1$)
Debug MatchRegularExpression(regex, s2$)
Debug MatchRegularExpression(regex, s3$)
Debug MatchRegularExpression(regex, s4$)
the outputs 3 and 4
1
0
1
1
are wrong because ä and ö are no word boundaries.

In other languages (like PHP) you can enable full Unicode support by setting a modifier like /u. Then the output is correct.
How can I enable full Unicode support in PureBasic?
Or what is the correct way to handle the example properly? The output should be
1
0
0
0

Henry
applePi
Addict
Addict
Posts: 1404
Joined: Sun Jun 25, 2006 7:28 pm

Re: How to enable Unicode support for regular expressions?

Post by applePi »

replace \b with \s and it will work CreateRegularExpression(#PB_Any, "\sglich\s")
i have tested with \b and it does not work with ä
to enable Unicode from:
1- File -> file format -> Encoding utf8
2- Preferences -> Compiler -> Defaults -> Source file Text encoding -> utf8
tested with 5.70 LTS x64 windows 7
The8th
User
User
Posts: 29
Joined: Fri Sep 04, 2015 10:23 am

Re: How to enable Unicode support for regular expressions?

Post by The8th »

applePi wrote:replace \b with \s
Sorry, but \s only includes words between spaces, but not word boundaries. That means, it does not cover brackets, punctuation marks, quotation marks and other stuff. And it does not cover all the numerous non-alpha characters in all languages worldwide which are meant to be word boundaries.

The file format of the PB source code editor has no effect on how regular expressions are treated. The strings inside the application are all UTF16.

When I e. g. in PHP create a regular expression
"/\bglich\b/"
the results are wrong as in my PB example.
But as soon as I create a regular expression
"/\bglich\b/u" (u = Full Unicode support)
the expression is treated correct also for Unicode characters.

Now: How do I apply the modifier /u to a PureBasic regular expression?

Henry
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: How to enable Unicode support for regular expressions?

Post by Little John »

The8th wrote:Now: How do I apply the modifier /u to a PureBasic regular expression?
I think something like that is currently not possible in PureBasic (see my related feature request).

However, you can emulate Unicode aware word boundaries by using lookaround.

after
Goyvaerts, Jan; Levithan, Stephen:
Regular Expressions Cookbook.
O'Reilly, 2nd ed. (2012), p.332

Code: Select all

EnableExplicit

#Rex_LeftBoundary$  = "(?<=[^\p{L}\p{M}\p{Nd}\p{Pc}]|^)"
#Rex_RightBoundary$ =  "(?=[^\p{L}\p{M}\p{Nd}\p{Pc}]|$)"

Define s1$ = "Keins glich dem anderen."
Define s2$ = "Bis man es anglich."
Define s3$ = "Man sieht das täglich."
Define s4$ = "Nichts ist unmöglich."
Define search$ = "glich"

Define regex.i = CreateRegularExpression(#PB_Any, #Rex_LeftBoundary$ + search$ + #Rex_RightBoundary$)

If regex
   Debug MatchRegularExpression(regex, s1$)
   Debug MatchRegularExpression(regex, s2$)
   Debug MatchRegularExpression(regex, s3$)
   Debug MatchRegularExpression(regex, s4$)
Else
   Debug RegularExpressionError()
EndIf
Output, as expected (using PB 5.71 beta 2 on Windows):
1
0
0
0
The8th
User
User
Posts: 29
Joined: Fri Sep 04, 2015 10:23 am

Re: How to enable Unicode support for regular expressions?

Post by The8th »

Thanks John for adding the request.
This also affects expressions like [^A-Za-z] or \W and similar which also fail for Unicode characters.
And thanks for your snippet. I will have a look at it tomorrow when I have more time.
Henry
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: How to enable Unicode support for regular expressions?

Post by Little John »

Instead of \w (lower case), you can use [\p{L}\p{M}\p{Nd}\p{Pc}],
instead of \W (upper case), you can use [^\p{L}\p{M}\p{Nd}\p{Pc}].
The8th
User
User
Posts: 29
Joined: Fri Sep 04, 2015 10:23 am

Re: How to enable Unicode support for regular expressions?

Post by The8th »

I tested your code, and it is working.
As I only need to check for a match I removed the positive lookbehind and lookbefore. Hope this saves some microseconds when searching several thousand files for matches :lol: .
If one needs the exact matches (e. g. for ExtractRegularExpression, ReplaceRegularExpression) one should of course leave them where they are.

Code: Select all

EnableExplicit
Define s1$ = "Keins glich dem anderen."
Define s2$ = "Bis man es anglich."
Define s3$ = "Man sieht das täglich."
Define s4$ = "Nichts ist unmöglich."
Define search$ = "glich"
Define regex.l = CreateRegularExpression(#PB_Any, "([^\p{L}\p{M}\p{Nd}\p{Pc}]|^)" + search$ + "([^\p{L}\p{M}\p{Nd}\p{Pc}]|$)")
Debug MatchRegularExpression(regex, s1$)
Debug MatchRegularExpression(regex, s2$)
Debug MatchRegularExpression(regex, s3$)
Debug MatchRegularExpression(regex, s4$)
Thanks again for your input.
Henry
The8th
User
User
Posts: 29
Joined: Fri Sep 04, 2015 10:23 am

Re: How to enable Unicode support for regular expressions?

Post by The8th »

Forget my last posting!
Positive lookbehind and lookahead are about 5 times faster!
I assumed the opposite would be true before I tested.
That is:
(?<=[^\p{L}\p{M}\p{Nd}\p{Pc}]|^) is faster than ([^\p{L}\p{M}\p{Nd}\p{Pc}]|^)
Henry
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: How to enable Unicode support for regular expressions?

Post by Sooraa »

Hi The8th,

The real Unicode-Support for \b, \w, \d, \s does also not work properly with the update of PCRE-Lib to 8.44 in PB5.72.
But this did'nt help it.

We have to turn on the UCP-Support of the PCRE-compiler during the "CreateRegularExpression" statement by preceding (*UCP) to the regex. For the example \bglich\b" it is "(*UCP)\bglich\b").

\b, \w, \d, \s work fine with it.
Post Reply