Page 1 of 1
RegularExpressionMatchString() have problems with emojis
Posted: Fri Apr 12, 2019 10:42 am
by dige
Using Emojis inside strings witj RegularExpressionMatchString(), will result in wrong (cutted) results.
You can test the code with Emoji here:
https://pastebin.com/QhNNR5St
or put your own inside "

Replace with Emoji " (due to the Forum crashes with Emojis)
regex_SC = CreateRegularExpression(#PB_Any, "^[\t]*[\ ]*EnablePbCgi([\s\S]*?)\(([\s\S]*?)^[\s]*DisablePbCgi", #PB_RegularExpression_MultiLine | #PB_RegularExpression_NoCase)
content.s = "EnablePbCgi" + #CRLF$ +
~"()\"

Replace with Emoji \"" + #CRLF$ +
"DisablePbCgi"
; Debug StringByteLength(content)
; ShowMemoryViewer(@content, StringByteLength(content))
; CallDebugger
ExamineRegularExpression(regex_SC, content)
If NextRegularExpressionMatch(regex_SC)
Debug RegularExpressionMatchString(regex_SC) ; Last Char ist missing
EndIf
Re: RegularExpressionMatchString() have problems with emojis
Posted: Fri Apr 12, 2019 12:36 pm
by NicTheQuick
I guess the issue here is that Purebasic uses UTF16 (Unicode) which means that every character is stored using 16 bits. Most Emojis use codes with a higher number than 16 Bit are able to store.
Or in other words: You can not map all possible UTF-8 characters in UTF-16. You can write Emojis in your Purebasic code because the file uses UTF-8 but after compiling the Emoji will be mapped to Unicode or there will be a compiler error.
Re: RegularExpressionMatchString() have problems with emojis
Posted: Fri Apr 12, 2019 12:47 pm
by STARGĂ…TE
But the special character is correctly stored as
surrogate:
3D D8 12 DD
Which means, that probably the PCRE-lib reads it as 1 character and returns a less (but currect) length, but for pure basic it is 2 characters, and lost the last character at the end.
Re: RegularExpressionMatchString() have problems with emojis
Posted: Wed Nov 25, 2020 10:47 am
by dige
After 3 hours of searching for a bug, I just found out that I got back into this RegEx bug again
But good news! Until RegularExpressionMatchString is not fixed, you can use ExtractRegularExpression() - it works!
Code: Select all
txt.s = "dummy" + #LF$ +
"EnablePbCgi" + #LF$ +
"Procedure.s Hallo()" + #CRLF$ +
"; -> Insert some Emojis here <- " + #LF$ +
"EndProcedure" + #LF$ +
"DisablePbCgi" + #LF$ +
"dummy"
regex = CreateRegularExpression(#PB_Any, "^[\t]*[\ ]*EnablePbCgi" + "([\s\S]*?)\(([\s\S]*?)^[\s]*DisablePbCgi", #PB_RegularExpression_MultiLine | #PB_RegularExpression_NoCase)
If regex
If ExamineRegularExpression(regex, txt)
While NextRegularExpressionMatch(regex)
Debug RegularExpressionMatchString(regex)
Wend
EndIf
Dim Result$(0)
NbResults = ExtractRegularExpression(regex, txt, result$())
For i = 0 To NbResults - 1
Debug Result$(i)
Next
Else
MessageRequester("Error", RegularExpressionError())
EndIf
Re: RegularExpressionMatchString() have problems with emojis
Posted: Wed Nov 25, 2020 7:44 pm
by #NULL
There was a coding question (
regex and emoji) with the same problem (I think).
Re: RegularExpressionMatchString() have problems with emojis
Posted: Thu Feb 16, 2023 6:14 pm
by Fred
PB is UCS2 only, so it doesn't support such case.