Page 1 of 1

Regular Expressions modifiers and delimiters

Posted: Fri Jun 15, 2012 10:27 am
by Kukulkan
Hello,

the documentation gives me no information about the difference between PCRE syntax and PureBasic syntax.

To search case insensitive, normally I use (PCRE style):

"/test/i"

In Purebasic, this does not work. It seems I have to use

"(?i)test"

Questions:
1) is this correct? And there are no delimiters in PureBasic?
2) shouldn't this be in the documentation?

Kukulkan

Re: Regular Expressions modifiers and delimiters

Posted: Sat Jun 16, 2012 1:27 am
by IdeasVacuum
PB4.61 Help:
All the regular expressions supported in PCRE will be supported in PureBasic
I assume that means syntax too.

Re: Regular Expressions modifiers and delimiters

Posted: Sat Jun 16, 2012 12:48 pm
by Kukulkan
Kukulkan wrote: "/test/i"

In Purebasic, this does not work. It seems I have to use

"(?i)test"
The official syntax with delimiters does not work in Purebasic:

Code: Select all

If CreateRegularExpression(0, "/some/i")

  Dim Result$(0)
  
  a = ExtractRegularExpression(0, "This is for SOME test.", result$())
  
  MessageRequester("Info", "Nb strings found: "+Str(a))
  
  For k=0 To a-1
    MessageRequester("Info", Result$(k))
  Next

Else
  MessageRequester("Error", RegularExpressionError())
EndIf
It does not work until you change the RegExp to

Code: Select all

If CreateRegularExpression(0, "(?i)some")
Kukulkan

Re: Regular Expressions modifiers and delimiters

Posted: Sat Jun 16, 2012 2:17 pm
by Little John
Kukulkan wrote:The official syntax with delimiters does not work in Purebasic:

Code: Select all

If CreateRegularExpression(0, "/some/i")
Are you sure that this is the official PCRE syntax?
http://www.pcre.org/pcre.txt wrote:PCRE_CASELESS

If this bit is set, letters in the pattern match both upper and lower
case letters. It is equivalent to Perl's /i option, and it can be
changed within a pattern by a (?i) option setting.
As I understand this quote, (?i) is the official PCRE syntax for this.

Regards, Little John

Re: Regular Expressions modifiers and delimiters

Posted: Sun Jun 17, 2012 11:02 am
by Little John
Kukulkan wrote:It does not work until you change the RegExp to

Code: Select all

If CreateRegularExpression(0, "(?i)some")
In the meantime I found out, that even this does not work correctly with special characters such as the German umlauts. The following code does not find a match (PB 4.61 on Windows XP x86, tested in ASCII mode and in Unicode mode):

Code: Select all

If CreateRegularExpression(0, "(?i)someäöü")
   Dim Result$(0)
   a = ExtractRegularExpression(0, "This is for SOMEÄÖÜ test.", result$())
   MessageRequester("Info", "Nb strings found: " + Str(a))
   For k = 0 To a-1
      MessageRequester("Info", Result$(k))
   Next
   
Else
   MessageRequester("Error", RegularExpressionError())
EndIf
So if there can be non-ASCII characters in your strings, on Windows it's better to use LCase() or UCase() instead:

Code: Select all

pattern$ = "someäöü"
search$ = "This is for SOMEÄÖÜ test."

If CreateRegularExpression(0, LCase(pattern$))
   Dim Result$(0)
   a = ExtractRegularExpression(0, LCase(search$), result$())
   MessageRequester("Info", "Nb strings found: " + Str(a))
   For k = 0 To a-1
      MessageRequester("Info", Result$(k))
   Next
   
Else
   MessageRequester("Error", RegularExpressionError())
EndIf
Using LCase() or UCase() on Linux does not help in this regard, because they can't handle special characters as well. :-(

Regards, Little John

Re: Regular Expressions modifiers and delimiters

Posted: Sat Sep 01, 2012 7:30 pm
by luis
@Little John

I think PCRE for PB has been compiled with UTF-8 support but without Unicode property support:
PCRE Help wrote: In UTF-8 mode,
PCRE always understands the concept of case for characters whose values are
less than 128, so caseless matching is always possible. For characters
with higher values, the concept of case is supported if PCRE is com-
piled with Unicode property support, but not otherwise. If you want to
use caseless matching for characters 128 and above, you must ensure
that PCRE is compiled with Unicode property support as well as with UTF-8 support.
This seem to be confirmed by this code:

Code: Select all

If CreateRegularExpression(0, "\p") = 0
    Debug RegularExpressionError() 
EndIf
prints:
support for \P, \p, and \X has not been compiled
I read that to use \p, \P or \X in regular expressions, PCRE must be compiled with the SUPPORT_UTF8 and SUPPORT_UCP (Unicode properties) conditional defines.

So the message above seem to suggest the Unicode property support is disabled.

In the end to fix all this and make the caseless match works for chars with a codepoint > 128 you should type ./configure --enable-unicode-properties before running make, or something like that.

So maybe you could make a request for that if you like.

Edit: In the meantime I did it -> http://www.purebasic.fr/english/viewtop ... =3&t=51463 ;)

Re: Regular Expressions modifiers and delimiters

Posted: Mon Dec 24, 2012 4:22 pm
by Little John
luis wrote:So maybe you could make a request for that if you like.

Edit: In the meantime I did it -> http://www.purebasic.fr/english/viewtop ... =3&t=51463 ;)
Thank you, Luis! ( I was too lazy. :-) )

BTW: The problem still exists in PB 5.10 Beta 1, when using the new #PB_RegularExpression_NoCase option.

Regards, Little John