It is currently Wed Nov 13, 2019 8:55 am

All times are UTC + 1 hour




Post new topic Reply to topic  [ 8 posts ] 
Author Message
 Post subject: How to enable Unicode support for regular expressions?
PostPosted: Sat Jul 27, 2019 7:43 am 
Offline
User
User

Joined: Fri Sep 04, 2015 10:23 am
Posts: 22
In the following example (PB 5.70 LTS X86)
Code:
EnableExplicit
Define s1$ = "Keins glich dem anderen."
Define s2$ = "Bis man es anglich."
Define s3$ = "Man sieht das täglich."
Define s4$ = "Nichts ist unmöglich."

Define regex.l = CreateRegularExpression(#PB_Any, "\bglich\b")
Debug MatchRegularExpression(regex, s1$)
Debug MatchRegularExpression(regex, s2$)
Debug MatchRegularExpression(regex, s3$)
Debug MatchRegularExpression(regex, s4$)
the outputs 3 and 4
1
0
1
1
are wrong because ä and ö are no word boundaries.

In other languages (like PHP) you can enable full Unicode support by setting a modifier like /u. Then the output is correct.
How can I enable full Unicode support in PureBasic?
Or what is the correct way to handle the example properly? The output should be
1
0
0
0

Henry


Top
 Profile  
Reply with quote  
 Post subject: Re: How to enable Unicode support for regular expressions?
PostPosted: Sat Jul 27, 2019 11:24 am 
Offline
Addict
Addict

Joined: Sun Jun 25, 2006 7:28 pm
Posts: 1377
replace \b with \s and it will work CreateRegularExpression(#PB_Any, "\sglich\s")
i have tested with \b and it does not work with ä
to enable Unicode from:
1- File -> file format -> Encoding utf8
2- Preferences -> Compiler -> Defaults -> Source file Text encoding -> utf8
tested with 5.70 LTS x64 windows 7


Top
 Profile  
Reply with quote  
 Post subject: Re: How to enable Unicode support for regular expressions?
PostPosted: Sat Jul 27, 2019 12:17 pm 
Offline
User
User

Joined: Fri Sep 04, 2015 10:23 am
Posts: 22
applePi wrote:
replace \b with \s
Sorry, but \s only includes words between spaces, but not word boundaries. That means, it does not cover brackets, punctuation marks, quotation marks and other stuff. And it does not cover all the numerous non-alpha characters in all languages worldwide which are meant to be word boundaries.

The file format of the PB source code editor has no effect on how regular expressions are treated. The strings inside the application are all UTF16.

When I e. g. in PHP create a regular expression
"/\bglich\b/"
the results are wrong as in my PB example.
But as soon as I create a regular expression
"/\bglich\b/u" (u = Full Unicode support)
the expression is treated correct also for Unicode characters.

Now: How do I apply the modifier /u to a PureBasic regular expression?

Henry


Top
 Profile  
Reply with quote  
 Post subject: Re: How to enable Unicode support for regular expressions?
PostPosted: Sat Jul 27, 2019 1:11 pm 
Offline
Addict
Addict
User avatar

Joined: Thu Jun 07, 2007 3:25 pm
Posts: 3699
Location: Berlin, Germany
The8th wrote:
Now: How do I apply the modifier /u to a PureBasic regular expression?
I think something like that is currently not possible in PureBasic (see my related feature request).

However, you can emulate Unicode aware word boundaries by using lookaround.

after
Goyvaerts, Jan; Levithan, Stephen:
Regular Expressions Cookbook.
O'Reilly, 2nd ed. (2012), p.332


Code:
EnableExplicit

#Rex_LeftBoundary$  = "(?<=[^\p{L}\p{M}\p{Nd}\p{Pc}]|^)"
#Rex_RightBoundary$ =  "(?=[^\p{L}\p{M}\p{Nd}\p{Pc}]|$)"

Define s1$ = "Keins glich dem anderen."
Define s2$ = "Bis man es anglich."
Define s3$ = "Man sieht das täglich."
Define s4$ = "Nichts ist unmöglich."
Define search$ = "glich"

Define regex.i = CreateRegularExpression(#PB_Any, #Rex_LeftBoundary$ + search$ + #Rex_RightBoundary$)

If regex
   Debug MatchRegularExpression(regex, s1$)
   Debug MatchRegularExpression(regex, s2$)
   Debug MatchRegularExpression(regex, s3$)
   Debug MatchRegularExpression(regex, s4$)
Else
   Debug RegularExpressionError()
EndIf

Output, as expected (using PB 5.71 beta 2 on Windows):
1
0
0
0

_________________
Please excuse my flawed English. My native language is PureBasic.
Search
RSBasic's backups


Top
 Profile  
Reply with quote  
 Post subject: Re: How to enable Unicode support for regular expressions?
PostPosted: Sat Jul 27, 2019 6:21 pm 
Offline
User
User

Joined: Fri Sep 04, 2015 10:23 am
Posts: 22
Thanks John for adding the request.
This also affects expressions like [^A-Za-z] or \W and similar which also fail for Unicode characters.
And thanks for your snippet. I will have a look at it tomorrow when I have more time.
Henry


Top
 Profile  
Reply with quote  
 Post subject: Re: How to enable Unicode support for regular expressions?
PostPosted: Sat Jul 27, 2019 7:43 pm 
Offline
Addict
Addict
User avatar

Joined: Thu Jun 07, 2007 3:25 pm
Posts: 3699
Location: Berlin, Germany
Instead of \w (lower case), you can use [\p{L}\p{M}\p{Nd}\p{Pc}],
instead of \W (upper case), you can use [^\p{L}\p{M}\p{Nd}\p{Pc}].

_________________
Please excuse my flawed English. My native language is PureBasic.
Search
RSBasic's backups


Top
 Profile  
Reply with quote  
 Post subject: Re: How to enable Unicode support for regular expressions?
PostPosted: Sun Jul 28, 2019 7:21 am 
Offline
User
User

Joined: Fri Sep 04, 2015 10:23 am
Posts: 22
I tested your code, and it is working.
As I only need to check for a match I removed the positive lookbehind and lookbefore. Hope this saves some microseconds when searching several thousand files for matches :lol: .
If one needs the exact matches (e. g. for ExtractRegularExpression, ReplaceRegularExpression) one should of course leave them where they are.
Code:
EnableExplicit
Define s1$ = "Keins glich dem anderen."
Define s2$ = "Bis man es anglich."
Define s3$ = "Man sieht das täglich."
Define s4$ = "Nichts ist unmöglich."
Define search$ = "glich"
Define regex.l = CreateRegularExpression(#PB_Any, "([^\p{L}\p{M}\p{Nd}\p{Pc}]|^)" + search$ + "([^\p{L}\p{M}\p{Nd}\p{Pc}]|$)")
Debug MatchRegularExpression(regex, s1$)
Debug MatchRegularExpression(regex, s2$)
Debug MatchRegularExpression(regex, s3$)
Debug MatchRegularExpression(regex, s4$)
Thanks again for your input.
Henry


Top
 Profile  
Reply with quote  
 Post subject: Re: How to enable Unicode support for regular expressions?
PostPosted: Sun Jul 28, 2019 8:19 am 
Offline
User
User

Joined: Fri Sep 04, 2015 10:23 am
Posts: 22
Forget my last posting!
Positive lookbehind and lookahead are about 5 times faster!
I assumed the opposite would be true before I tested.
That is:
(?<=[^\p{L}\p{M}\p{Nd}\p{Pc}]|^) is faster than ([^\p{L}\p{M}\p{Nd}\p{Pc}]|^)
Henry


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 8 posts ] 

All times are UTC + 1 hour


Who is online

Users browsing this forum: Joris and 17 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  

 


Powered by phpBB © 2008 phpBB Group
subSilver+ theme by Canver Software, sponsor Sanal Modifiye