ExtractRegularExpression splits strings incorrect.

Just starting out? Need help? Post your questions and find answers here.
Joris
Addict
Addict
Posts: 890
Joined: Fri Oct 16, 2009 10:12 am
Location: BE

ExtractRegularExpression splits strings incorrect.

Post by Joris »

http://www.purebasic.fr/english/viewtop ... 18#p409018

Commenting this linked topic, I tried some RegularExpression like shown below and in the helpfile url : http://www.pcre.org/pcre.txt
Shouldn't the ExtractRegularExpression() split the string into words instead of single characters with the \w expression ?
Another use of backslash is for specifying generic character types:
\d any decimal digit
\D any character that is not a decimal digit
\h any horizontal white space character
\H any character that is not a horizontal white space character
\s any white space character
\S any character that is not a white space character
\v any vertical white space character
\V any character that is not a vertical white space character
\w any "word" character
\W any "non-word" character
Try the source below with \w or \W and \s or \S , which imo should do the opposite between lower and upper case of the expression.
It looks like text can only be split in single characters not in words, while \w and \W really stands for words imo.

Code: Select all

If CreateRegularExpression(0, "\w",#PB_RegularExpression_DotAll)
    Dim Result$(0)
    NbFound = ExtractRegularExpression(0, "abC ABc zbA abc", Result$())
     Debug NbFound
    For k = 0 To NbFound-1
      Debug Result$(k)
    Next
  Else
    Debug RegularExpressionError()
  EndIf
Thanks
Yeah I know, but keep in mind ... Leonardo da Vinci was also an autodidact.
User avatar
STARGÅTE
Addict
Addict
Posts: 2232
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: ExtractRegularExpression splits strings incorrect.

Post by STARGÅTE »

\w is only one word-character, if you would extract words use \w+
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
Joris
Addict
Addict
Posts: 890
Joined: Fri Oct 16, 2009 10:12 am
Location: BE

Re: ExtractRegularExpression splits strings incorrect.

Post by Joris »

STARGÅTE wrote:\w is only one word-character, if you would extract words use \w+
Ok, nice that works, it differs from the specification and how it works in other software, but yeah I got it.
Still the "\S any character that is not a white space character" becomes the same as "\w any "word" character", or else where do they differ then ?

Thanks
Yeah I know, but keep in mind ... Leonardo da Vinci was also an autodidact.
User avatar
STARGÅTE
Addict
Addict
Posts: 2232
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: ExtractRegularExpression splits strings incorrect.

Post by STARGÅTE »

Joris wrote:Still the "\S any character that is not a white space character" becomes the same as "\w any "word" character", or else where do they differ then ?
\w = characters: a-z, A-Z, 0-1 and _
\S = no characters like: space, tab, ...

So \w+ don't match "Äpfel".
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
Joris
Addict
Addict
Posts: 890
Joined: Fri Oct 16, 2009 10:12 am
Location: BE

Re: ExtractRegularExpression splits strings incorrect.

Post by Joris »

STARGÅTE wrote:\w = characters: a-z, A-Z, 0-1 and _
\S = no characters like: space, tab, ...

So \w+ don't match "Äpfel".
Indeed it don't match that "Ä" but it should imo : "\w any word character" ???
(The PB RegularExpression interpretation also differs from other software in this.)
Yeah I know, but keep in mind ... Leonardo da Vinci was also an autodidact.
Post Reply