RegularExpressionMatchString() have problems with emojis

Just starting out? Need help? Post your questions and find answers here.
dige
Addict
Addict
Posts: 1247
Joined: Wed Apr 30, 2003 8:15 am
Location: Germany
Contact:

RegularExpressionMatchString() have problems with emojis

Post by dige »

Using Emojis inside strings witj RegularExpressionMatchString(), will result in wrong (cutted) results.

You can test the code with Emoji here: https://pastebin.com/QhNNR5St
or put your own inside " :-) Replace with Emoji " (due to the Forum crashes with Emojis)
regex_SC = CreateRegularExpression(#PB_Any, "^[\t]*[\ ]*EnablePbCgi([\s\S]*?)\(([\s\S]*?)^[\s]*DisablePbCgi", #PB_RegularExpression_MultiLine | #PB_RegularExpression_NoCase)

content.s = "EnablePbCgi" + #CRLF$ +
~"()\" :-) Replace with Emoji \"" + #CRLF$ +
"DisablePbCgi"

; Debug StringByteLength(content)
; ShowMemoryViewer(@content, StringByteLength(content))
; CallDebugger

ExamineRegularExpression(regex_SC, content)

If NextRegularExpressionMatch(regex_SC)
Debug RegularExpressionMatchString(regex_SC) ; Last Char ist missing
EndIf
"Daddy, I'll run faster, then it is not so far..."
User avatar
NicTheQuick
Addict
Addict
Posts: 1224
Joined: Sun Jun 22, 2003 7:43 pm
Location: Germany, Saarbrücken
Contact:

Re: RegularExpressionMatchString() have problems with emojis

Post by NicTheQuick »

I guess the issue here is that Purebasic uses UTF16 (Unicode) which means that every character is stored using 16 bits. Most Emojis use codes with a higher number than 16 Bit are able to store.
Or in other words: You can not map all possible UTF-8 characters in UTF-16. You can write Emojis in your Purebasic code because the file uses UTF-8 but after compiling the Emoji will be mapped to Unicode or there will be a compiler error.
The english grammar is freeware, you can use it freely - But it's not Open Source, i.e. you can not change it or publish it in altered way.
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: RegularExpressionMatchString() have problems with emojis

Post by STARGÅTE »

But the special character is correctly stored as surrogate:
3D D8 12 DD
Which means, that probably the PCRE-lib reads it as 1 character and returns a less (but currect) length, but for pure basic it is 2 characters, and lost the last character at the end.
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
dige
Addict
Addict
Posts: 1247
Joined: Wed Apr 30, 2003 8:15 am
Location: Germany
Contact:

Re: RegularExpressionMatchString() have problems with emojis

Post by dige »

After 3 hours of searching for a bug, I just found out that I got back into this RegEx bug again :evil:
But good news! Until RegularExpressionMatchString is not fixed, you can use ExtractRegularExpression() - it works!

Code: Select all

 txt.s = "dummy" + #LF$ +
        "EnablePbCgi" + #LF$ +
        "Procedure.s Hallo()" + #CRLF$ +
        "; -> Insert some Emojis here <- " + #LF$ +
        "EndProcedure" + #LF$ +
        "DisablePbCgi" + #LF$ +
        "dummy"


regex = CreateRegularExpression(#PB_Any, "^[\t]*[\ ]*EnablePbCgi" + "([\s\S]*?)\(([\s\S]*?)^[\s]*DisablePbCgi", #PB_RegularExpression_MultiLine | #PB_RegularExpression_NoCase)

If regex
    
  
  If ExamineRegularExpression(regex, txt)
    While NextRegularExpressionMatch(regex)
      Debug RegularExpressionMatchString(regex)
    Wend
  EndIf
  
  Dim Result$(0)
  
  NbResults = ExtractRegularExpression(regex, txt, result$())
  
  For i = 0 To NbResults - 1
    Debug Result$(i)
  Next

Else
  MessageRequester("Error", RegularExpressionError())
EndIf


"Daddy, I'll run faster, then it is not so far..."
#NULL
Addict
Addict
Posts: 1440
Joined: Thu Aug 30, 2007 11:54 pm
Location: right here

Re: RegularExpressionMatchString() have problems with emojis

Post by #NULL »

There was a coding question (regex and emoji) with the same problem (I think).
Fred
Administrator
Administrator
Posts: 16619
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: RegularExpressionMatchString() have problems with emojis

Post by Fred »

PB is UCS2 only, so it doesn't support such case.
Post Reply