Page 1 of 1

RegularExpressionMatchString() have problems with emojis

Posted: Fri Apr 12, 2019 10:42 am
by dige
Using Emojis inside strings witj RegularExpressionMatchString(), will result in wrong (cutted) results.

You can test the code with Emoji here: https://pastebin.com/QhNNR5St
or put your own inside " :-) Replace with Emoji " (due to the Forum crashes with Emojis)
regex_SC = CreateRegularExpression(#PB_Any, "^[\t]*[\ ]*EnablePbCgi([\s\S]*?)\(([\s\S]*?)^[\s]*DisablePbCgi", #PB_RegularExpression_MultiLine | #PB_RegularExpression_NoCase)

content.s = "EnablePbCgi" + #CRLF$ +
~"()\" :-) Replace with Emoji \"" + #CRLF$ +
"DisablePbCgi"

; Debug StringByteLength(content)
; ShowMemoryViewer(@content, StringByteLength(content))
; CallDebugger

ExamineRegularExpression(regex_SC, content)

If NextRegularExpressionMatch(regex_SC)
Debug RegularExpressionMatchString(regex_SC) ; Last Char ist missing
EndIf

Re: RegularExpressionMatchString() have problems with emojis

Posted: Fri Apr 12, 2019 12:36 pm
by NicTheQuick
I guess the issue here is that Purebasic uses UTF16 (Unicode) which means that every character is stored using 16 bits. Most Emojis use codes with a higher number than 16 Bit are able to store.
Or in other words: You can not map all possible UTF-8 characters in UTF-16. You can write Emojis in your Purebasic code because the file uses UTF-8 but after compiling the Emoji will be mapped to Unicode or there will be a compiler error.

Re: RegularExpressionMatchString() have problems with emojis

Posted: Fri Apr 12, 2019 12:47 pm
by STARGĂ…TE
But the special character is correctly stored as surrogate:
3D D8 12 DD
Which means, that probably the PCRE-lib reads it as 1 character and returns a less (but currect) length, but for pure basic it is 2 characters, and lost the last character at the end.

Re: RegularExpressionMatchString() have problems with emojis

Posted: Wed Nov 25, 2020 10:47 am
by dige
After 3 hours of searching for a bug, I just found out that I got back into this RegEx bug again :evil:
But good news! Until RegularExpressionMatchString is not fixed, you can use ExtractRegularExpression() - it works!

Code: Select all

 txt.s = "dummy" + #LF$ +
        "EnablePbCgi" + #LF$ +
        "Procedure.s Hallo()" + #CRLF$ +
        "; -> Insert some Emojis here <- " + #LF$ +
        "EndProcedure" + #LF$ +
        "DisablePbCgi" + #LF$ +
        "dummy"


regex = CreateRegularExpression(#PB_Any, "^[\t]*[\ ]*EnablePbCgi" + "([\s\S]*?)\(([\s\S]*?)^[\s]*DisablePbCgi", #PB_RegularExpression_MultiLine | #PB_RegularExpression_NoCase)

If regex
    
  
  If ExamineRegularExpression(regex, txt)
    While NextRegularExpressionMatch(regex)
      Debug RegularExpressionMatchString(regex)
    Wend
  EndIf
  
  Dim Result$(0)
  
  NbResults = ExtractRegularExpression(regex, txt, result$())
  
  For i = 0 To NbResults - 1
    Debug Result$(i)
  Next

Else
  MessageRequester("Error", RegularExpressionError())
EndIf



Re: RegularExpressionMatchString() have problems with emojis

Posted: Wed Nov 25, 2020 7:44 pm
by #NULL
There was a coding question (regex and emoji) with the same problem (I think).

Re: RegularExpressionMatchString() have problems with emojis

Posted: Thu Feb 16, 2023 6:14 pm
by Fred
PB is UCS2 only, so it doesn't support such case.