[SOLVED] CompareMemory() with non-ASCII

Just starting out? Need help? Post your questions and find answers here.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

[SOLVED] CompareMemory() with non-ASCII

Post by Oso »

I've been reading quite a lot on the forums about Unicode and UTF-8, as I'm in the process of writing new code. Initially I was a bit unsure whether to use #PB_Unicode or #PB_UTF8 when sending/receiving network data, the reason being that some documentation pages mention that #PB_Unicode is the default, while other pages mention that #PB_UTF8 is the default. I accept that there must be good reasons for that, depending on the typical application of the function.

(1) Would I be correct in saying that #PB_Unicode is necessary if I want to use the below search by using CompareMemory? I got some odd results if the memory contained #PB_UTF8 data, as I think characters under UTF8 are of variable length. The key question perhaps — (2) unless one needs to write data to a file for reading by another application that needs UTF8, is #PB_Unicode generally the best choice internally in PB? Thanks

Code: Select all

; **
; ** Search *Buffer for *SearchBuf, beginning at StartPos.i, ending at EndPos.i
; **
Procedure.i Search_Receive(StartPos.i, EndPos.i, *Buffer, *SearchBuf, SearchLen.i)

  Protected ChrPos.i, FoundPos.i
  
  EndPos.i - SearchLen.i + 1                                            ; Don't need to search beyond frame end - search length
  For ChrPos.i = StartPos.i To EndPos.i Step 2                          ; Search from specf'd position in the buffer onwards
    If CompareMemory(*Buffer + ChrPos.i - 1, *SearchBuf, SearchLen.i)   ; Compare next 'n' bytes with the full search string
      FoundPos.i = ChrPos.i                                             ; Indicate we found the match at this char. position
      Break                                                             ; Don't continue, as we've found the match
    EndIf
  Next

  ProcedureReturn FoundPos.i                                            ; Return the found character position (or 0 if not)

EndProcedure
Last edited by Oso on Fri Feb 03, 2023 10:28 am, edited 1 time in total.
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: CompareMemory() with non-ASCII

Post by STARGÅTE »

Regarding the first point, whether #PB_Unicode or #PB_UTF8 is the default option:
For all function handing the string internally in the application memory, the default is #PB_Unicode, because it is fast due to its fixed character length.
For all function handing the string outside in a file or in network, the default is #PB_UTF8, because it uses less space and it is independent of the Endianness of the processor.

Regarding your questions:
(1) Yes, in this case #PB_Unicode should be preferred. However, why not using CompareMemoryString()?
(2) Yes. Many function like picking parts of the string are faster because of the fixed character length. In UTF8 you need to read the whole beginning of a string to extract the correct memory position out from the character position.
However, UTF8 could be an option when larger unicode code points (>65535, e.g. for Emojis) are needed which are not fit into PB's unicode type without cracking the fixed character length.
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: CompareMemory() with non-ASCII

Post by Oso »

STARGÅTE wrote: Mon Jan 30, 2023 9:10 pm (1) Yes, in this case #PB_Unicode should be preferred. However, why not using CompareMemoryString()? (2) Yes. Many function like picking parts of the string are faster because of the fixed character length. In UTF8 you need to read the whole beginning of a string to extract the correct memory position out from the character position. However, UTF8 could be an option when larger unicode code points (>65535, e.g. for Emojis) are needed which are not fit into PB's unicode type without cracking the fixed character length.
Many thanks STARGÅTE, it's good to have these points wrapped up so clearly. A lot of the time, with a concept that's new, one can easily stumble when it comes to choosing the best practice. :D In fact, the points that you make would be welcome additions to the documentation. I understand what you mean about the limitation of 65535, but I think we're going to be okay on that. To answer your question, I'm receiving network data into a buffer and there isn't a receive network string function — only ReceiveNetworkData(). For performance reasons, I process quite a lot in memory and avoid moving that into a string, at least as far as I can anyway.
User avatar
idle
Always Here
Always Here
Posts: 5042
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: CompareMemory() with non-ASCII

Post by idle »

You need to know the data format, utf8 encoding is the standard on the web for strings these days. So if you're fetching a web page convert your needle to utf8 or convert your buffer to pb unicode. If it's your own sever client do what suits you.

Unicode is complicated just take a look at that strcmp module, it would have been easier to do with a code reference
It's a replacement for comparememorystring and handles the full utf16 range. And with the c backend it's just as fast.
User avatar
STARGÅTE
Addict
Addict
Posts: 2067
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: CompareMemory() with non-ASCII

Post by STARGÅTE »

Oso wrote: Mon Jan 30, 2023 9:32 pm To answer your question, I'm receiving network data into a buffer and there isn't a receive network string function — only ReceiveNetworkData(). For performance reasons, I process quite a lot in memory and avoid moving that into a string, at least as far as I can anyway.
CompareMemoryString(*String1, *String2 [, Mode [, Length [, Flags]]]) is not function with string argument, it is a function using memory buffers, but interpreting them as string, like you want to have. You can use it in your Search_Receive() procedure.
But of course, when using unicode only, Compare Memory() is effectively the same.
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: CompareMemory() with non-ASCII

Post by Oso »

idle wrote: Mon Jan 30, 2023 9:40 pmUnicode is complicated just take a look at that strcmp module, it would have been easier to do with a code reference. It's a replacement for comparememorystring and handles the full utf16 range. And with the c backend it's just as fast.
You're right there Idle -- I've spent all of today trying to adapt and fix my code and design a data transfer protocol for receiving my own data from the server end :(
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: CompareMemory() with non-ASCII

Post by Oso »

STARGÅTE wrote: Mon Jan 30, 2023 10:10 pm CompareMemoryString(*String1, *String2 [, Mode [, Length [, Flags]]]) is not function with string argument, it is a function using memory buffers, but interpreting them as string, like you want to have. You can use it in your Search_Receive() procedure.
Ah, I see what you mean. I didn't realise it took pointers as arguments. That's one for tomorrow to look at... thanks for mentioning this STARGÅTE. :)
User avatar
idle
Always Here
Always Here
Posts: 5042
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: CompareMemory() with non-ASCII

Post by idle »

Oso wrote: Mon Jan 30, 2023 10:24 pm
idle wrote: Mon Jan 30, 2023 9:40 pmUnicode is complicated just take a look at that strcmp module, it would have been easier to do with a code reference. It's a replacement for comparememorystring and handles the full utf16 range. And with the c backend it's just as fast.
You're right there Idle -- I've spent all of today trying to adapt and fix my code and design a data transfer protocol for receiving my own data from the server end :(
Have you looked at the json lib to serialize structures.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: CompareMemory() with non-ASCII

Post by Oso »

idle wrote: Tue Jan 31, 2023 7:06 am Have you looked at the json lib to serialize structures.
I've had a look before, but in relation to this project, no I didn't think of that to be honest. I can see what you mean though. I'll have to give that some serious thought.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: CompareMemory() with non-ASCII

Post by Oso »

Oso wrote: Mon Jan 30, 2023 10:27 pm
STARGÅTE wrote: Mon Jan 30, 2023 10:10 pm CompareMemoryString(*String1, *String2 [, Mode [, Length [, Flags]]]) is not function with string argument, it is a function using memory buffers, but interpreting them as string, like you want to have. You can use it in your Search_Receive() procedure.
Ah, I see what you mean. I didn't realise it took pointers as arguments. That's one for tomorrow to look at... thanks for mentioning this STARGÅTE. :)
I've been looking at CompareMemoryString() and how it differs from CompareMemory(), following what you said yesterday. I hadn't really noticed the former. It's quite a curious function — in some ways it seems to serve the same purpose as CompareMemory(), because if you know that the two blocks of memory contents contain the same format of string data, it might not matter which is used. However, it does appear to offer something more — there's a greater-than/less-than result and there's also the ability to terminate on null.
User avatar
idle
Always Here
Always Here
Posts: 5042
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: CompareMemory() with non-ASCII

Post by idle »

Oso wrote: Tue Jan 31, 2023 10:08 am
idle wrote: Tue Jan 31, 2023 7:06 am Have you looked at the json lib to serialize structures.
I've had a look before, but in relation to this project, no I didn't think of that to be honest. I can see what you mean though. I'll have to give that some serious thought.
I'm looking into it myself and currently experimenting with adding runtime reflection to automate the serialization.
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: CompareMemory() with non-ASCII

Post by Oso »

idle wrote: Mon Jan 30, 2023 9:40 pm Unicode is complicated just take a look at that strcmp module...
This has caught me out today. The length parameter for PeekS() in #PB_Unicode is not the number of bytes in memory, but the number of two-byte pairs. It must therefore be divided by 2 — at least that's how it looks to me. In other words, the return variable recvlen.i from ReceiveNetworkData() is giving me a value of 72, but I only need to specify 36 in the PeekS() :?

Code: Select all

      recvlen.i = ReceiveNetworkData(connectid.i, *Buffer, #TCPMAX)
      If recvlen.i > 0
        PrintN("Received data length  : " + Str(recvlen.i))
        PrintN("Received string       : " + PeekS(*Buffer, recvlen / 2, #PB_Unicode))
        PrintN("Character length      : " + Str(Len(PeekS(*Buffer, recvlen / 2, #PB_Unicode))))
Received string : HELLO|ab|DSP|Line01|ab|DSP|LineTwo|x
Received length : 72
Character length : 36
User avatar
idle
Always Here
Always Here
Posts: 5042
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: CompareMemory() with non-ASCII

Post by idle »

If you include the null char you can peek with length -1
User avatar
mk-soft
Always Here
Always Here
Posts: 5335
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: CompareMemory() with non-ASCII

Post by mk-soft »

Send and receive in UTF8. For further processing you have to convert to Unicode anyway.
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
Oso
Enthusiast
Enthusiast
Posts: 595
Joined: Wed Jul 20, 2022 10:09 am

Re: CompareMemory() with non-ASCII

Post by Oso »

idle wrote: Tue Jan 31, 2023 12:00 pm If you include the null char you can peek with length -1
Ah, good point, that could simplify things.
mk-soft wrote: Tue Jan 31, 2023 12:47 pm Send and receive in UTF8. For further processing you have to convert to Unicode anyway.
Yes, maybe that would be better. For characters in the traditional ASCII range below 128, UTF8 only requires 1 byte, so it would reduce transmission size. A lot of the data is lower-end ASCII but not everything.
Post Reply