Page 1 of 1

ExtractRegularExpression splits strings incorrect.

Posted: Mon Mar 25, 2013 2:33 pm
by Joris
http://www.purebasic.fr/english/viewtop ... 18#p409018

Commenting this linked topic, I tried some RegularExpression like shown below and in the helpfile url : http://www.pcre.org/pcre.txt
Shouldn't the ExtractRegularExpression() split the string into words instead of single characters with the \w expression ?
Another use of backslash is for specifying generic character types:
\d any decimal digit
\D any character that is not a decimal digit
\h any horizontal white space character
\H any character that is not a horizontal white space character
\s any white space character
\S any character that is not a white space character
\v any vertical white space character
\V any character that is not a vertical white space character
\w any "word" character
\W any "non-word" character
Try the source below with \w or \W and \s or \S , which imo should do the opposite between lower and upper case of the expression.
It looks like text can only be split in single characters not in words, while \w and \W really stands for words imo.

Code: Select all

If CreateRegularExpression(0, "\w",#PB_RegularExpression_DotAll)
    Dim Result$(0)
    NbFound = ExtractRegularExpression(0, "abC ABc zbA abc", Result$())
     Debug NbFound
    For k = 0 To NbFound-1
      Debug Result$(k)
    Next
  Else
    Debug RegularExpressionError()
  EndIf
Thanks

Re: ExtractRegularExpression splits strings incorrect.

Posted: Mon Mar 25, 2013 2:57 pm
by STARGÅTE
\w is only one word-character, if you would extract words use \w+

Re: ExtractRegularExpression splits strings incorrect.

Posted: Mon Mar 25, 2013 3:23 pm
by Joris
STARGÅTE wrote:\w is only one word-character, if you would extract words use \w+
Ok, nice that works, it differs from the specification and how it works in other software, but yeah I got it.
Still the "\S any character that is not a white space character" becomes the same as "\w any "word" character", or else where do they differ then ?

Thanks

Re: ExtractRegularExpression splits strings incorrect.

Posted: Mon Mar 25, 2013 4:40 pm
by STARGÅTE
Joris wrote:Still the "\S any character that is not a white space character" becomes the same as "\w any "word" character", or else where do they differ then ?
\w = characters: a-z, A-Z, 0-1 and _
\S = no characters like: space, tab, ...

So \w+ don't match "Äpfel".

Re: ExtractRegularExpression splits strings incorrect.

Posted: Tue Mar 26, 2013 10:06 am
by Joris
STARGÅTE wrote:\w = characters: a-z, A-Z, 0-1 and _
\S = no characters like: space, tab, ...

So \w+ don't match "Äpfel".
Indeed it don't match that "Ä" but it should imo : "\w any word character" ???
(The PB RegularExpression interpretation also differs from other software in this.)