ExtractRegularExpression splits strings incorrect.

Joris · Post by **Joris** » Mon Mar 25, 2013 2:33 pm

http://www.purebasic.fr/english/viewtop ... 18#p409018

Commenting this linked topic, I tried some RegularExpression like shown below and in the helpfile url : http://www.pcre.org/pcre.txt
Shouldn't the ExtractRegularExpression() split the string into words instead of single characters with the \w expression ?

Another use of backslash is for specifying generic character types:
\d any decimal digit
\D any character that is not a decimal digit
\h any horizontal white space character
\H any character that is not a horizontal white space character
\s any white space character
\S any character that is not a white space character
\v any vertical white space character
\V any character that is not a vertical white space character
\w any "word" character
\W any "non-word" character

Try the source below with \w or \W and \s or \S , which imo should do the opposite between lower and upper case of the expression.
It looks like text can only be split in single characters not in words, while \w and \W really stands for words imo.

Code: Select all

If CreateRegularExpression(0, "\w",#PB_RegularExpression_DotAll)
    Dim Result$(0)
    NbFound = ExtractRegularExpression(0, "abC ABc zbA abc", Result$())
     Debug NbFound
    For k = 0 To NbFound-1
      Debug Result$(k)
    Next
  Else
    Debug RegularExpressionError()
  EndIf

Thanks

STARGÅTE · Post by **STARGÅTE** » Mon Mar 25, 2013 2:57 pm

\w is only one word-character, if you would extract words use \w+

Joris · Post by **Joris** » Mon Mar 25, 2013 3:23 pm

STARGÅTE wrote:\w is only one word-character, if you would extract words use \w+

Ok, nice that works, it differs from the specification and how it works in other software, but yeah I got it.
Still the "\S any character that is not a white space character" becomes the same as "\w any "word" character", or else where do they differ then ?

Thanks

STARGÅTE · Post by **STARGÅTE** » Mon Mar 25, 2013 4:40 pm

Joris wrote:Still the "\S any character that is not a white space character" becomes the same as "\w any "word" character", or else where do they differ then ?

\w = characters: a-z, A-Z, 0-1 and _
\S = no characters like: space, tab, ...

So \w+ don't match "Äpfel".

Joris · Post by **Joris** » Tue Mar 26, 2013 10:06 am

STARGÅTE wrote:\w = characters: a-z, A-Z, 0-1 and _
\S = no characters like: space, tab, ...

So \w+ don't match "Äpfel".

Indeed it don't match that "Ä" but it should imo : "\w any word character" ???
(The PB RegularExpression interpretation also differs from other software in this.)

PureBasic Forums - English

ExtractRegularExpression splits strings incorrect.

ExtractRegularExpression splits strings incorrect.

Re: ExtractRegularExpression splits strings incorrect.

Re: ExtractRegularExpression splits strings incorrect.

Re: ExtractRegularExpression splits strings incorrect.

Re: ExtractRegularExpression splits strings incorrect.