Page 1 of 2

Convert all special chars in a text into regular letters...

Posted: Tue Jan 08, 2013 11:01 pm
by Andre
I'm sure I've seen something before, but I couldn't find any useful code example... :oops:

I want to have a small function, which is able to convert all special chars occuring in longer text strings into their regular letters.

For example take the text and look for all "special chars" listed in the first line, and convert them into their respective counterparts listed on the second line:

Code: Select all

   ÄÂäãåæçÐÉÈéèïíîñÖöøÞŠšÜüÚúý
   AAaaaacDEEeeiiInOoopSsUeUuy
Of course I can do this using a loop with several ReplaceString() and similar calls, but I hope there is a faster way (usable for long texts / thousands of text strings....

Any small example would be very welcome! Thanks! :D

Re: Convert all special chars in a text into regular letters

Posted: Tue Jan 08, 2013 11:19 pm
by Little John
Hi André,

maybe the code in this message is useful for you.

The second procedure, written in ASM, is considerably faster than the first one, but will probably not work as desired when compiled in Unicode mode. The first procedure should work also in Unicode mode, when you replace .b and \b with .c and \c.

Regards, Little John

Re: Convert all special chars in a text into regular letters

Posted: Wed Jan 09, 2013 12:57 am
by idle
Just build and use a look up table

partial example, I'm not even sure what some characters map to

Code: Select all

Global Dim aLookUp(256)

Procedure init_lookUP()
   Protected sout.s 
   For a = 0 To 255 
      If a < 192 
        aLookUp(a) = a 
     ElseIf a >=192 And a <= 198     
        aLookUp(a)=65 
     ElseIf a = 199 
        alookup(a) = 67 
     ElseIf a >= 200 And a <=203 
        aLookUp(a) = 69 
     ElseIf a >=204 And a <=207 
        aLookUp(a) =  73 
     ElseIf a = 208 
        aLookUp(a) = 68    
     ElseIf a = 209 
        aLookUp(a) = 78
     ElseIf a >=210 And a <= 214 
        aLookUp(a) = 79 
     ElseIf a >=217 And a <= 220 
        aLookUp(a) = 85 
     ElseIf a = 221 
        aLookUp(a) = 89 
     Else 
        aLookUp(a) = a 
     EndIf    
  Next
 
 EndProcedure 
 
Procedure convert(*Input,len) 
  Protected *pa.Ascii
   *pa = *input 
   For a = 1 To len 
      *pa\a = aLookUp(*pra\a) 
     *pa+1 
   Next 
EndProcedure 

Init_lookUP() 
Define strA.s = "ÄÂÐÉÈÖÜÚ"

Convert(@strA,Len(strA)) 
Debug strA 


Re: Convert all special chars in a text into regular letters

Posted: Tue Jan 15, 2013 10:44 pm
by Andre
Thank you friends, for your suggestions! I just hadn't the time to test / implement the example codes...

This can still take some time, as I'm currenty finishing another part of my project, and after this I have to create the german PB5.10 docs... :wink:

Re: Convert all special chars in a text into regular letters

Posted: Sat Mar 02, 2013 12:05 am
by Andre
Thank you guys for all your help. :D

I've choosen the example idle posted. Here is my adapted (and commented) version, which I use now in my project and want to share with you:

Code: Select all

Global Dim aLookUp(256)

Procedure init_lookUP()
  Protected a
  For a = 0 To 255 
    If a < 192 
      aLookUp(a) = a 
    ElseIf a >=192 And a <= 195      ; 'A' like chars, but we don't convert german umlaut 'Ä' (=196)
      aLookUp(a) = 65 
    ElseIf a >=197 And a <= 198      ; 'A' like chars
      aLookUp(a) = 65 
    ElseIf a = 199                   ; 'C' like char
      alookup(a) = 67 
    ElseIf a >= 200 And a <=203      ; 'E' like chars
      aLookUp(a) = 69 
    ElseIf a >=204 And a <=207       ; 'I' like chars
      aLookUp(a) = 73 
    ElseIf a = 208                   ; 'D' like char
      aLookUp(a) = 68    
    ElseIf a = 209                   ; 'N' like char
      aLookUp(a) = 78
    ElseIf a >=210 And a <= 213      ; 'O' like chars, but we don't convert german umlaut 'Ö' (=214)
      aLookUp(a) = 79 
    ElseIf a = 216                   ; 'O' like char
      aLookUp(a) = 79 
    ElseIf a >=217 And a <= 219      ; 'U' like chars, but we don't convert german umlaut 'Ü' (=220)
      aLookUp(a) = 85 
    ElseIf a = 221                   ; 'Y' like char
      aLookUp(a) = 89 
    ElseIf a >=224 And a <= 227      ; 'a' like chars, but we don't convert german umlaut 'ä' (=228)
      aLookUp(a) = 97 
    ElseIf a >=229 And a <= 230      ; 'a' like chars
      aLookUp(a) = 97 
    ElseIf a = 231                   ; 'c' like char
      alookup(a) = 99 
    ElseIf a >= 232 And a <=235      ; 'e' like chars
      aLookUp(a) = 101 
    ElseIf a >=236 And a <=239       ; 'i' like chars
      aLookUp(a) = 105 
    ElseIf a = 241                   ; 'n' like char
      aLookUp(a) = 110
    ElseIf a >=242 And a <= 245      ; 'o' like chars, but we don't convert german umlaut 'ö' (=246)
      aLookUp(a) = 111 
    ElseIf a = 248                   ; 'o' like char
      aLookUp(a) = 111
    ElseIf a >=249 And a <= 251      ; 'u' like chars, but we don't convert german umlaut 'ü' (=252)
      aLookUp(a) = 117 
    ElseIf a = 253                   ; 'y' like char
      aLookUp(a) = 121 
    ElseIf a = 255                   ; 'y' like char
      aLookUp(a) = 121 
    Else 
      aLookUp(a) = a 
    EndIf    
  Next 
EndProcedure 

Procedure ConvertSpecialChars(*Input, len)
  Protected *pa.Ascii
  Protected a
  *pa = *input 
  For a = 1 To len 
    *pa\a = aLookUp(*pa\a) 
    *pa+1 
  Next 
EndProcedure 

Init_lookUP() 

; Example:
Define strA.s = "ÄÂÐÉÈÖÜÚàâæçê"
Debug "Original: " + strA
ConvertSpecialChars(@strA,Len(strA)) 
Debug "Converted: " + strA 

Re: Convert all special chars in a text into regular letters

Posted: Sat Mar 02, 2013 4:08 am
by MachineCode

Re: Convert all special chars in a text into regular letters

Posted: Sat Mar 02, 2013 5:29 am
by idle
converted your lookup table into data section to save the need to init

Code: Select all

Procedure ConvertSpecialChars(*Input, len)
    Protected *pa.Ascii,*pb.Ascii,*mem 
    Protected a
    *mem = ?lookupTable:
    *pa = *input 
    For a = 1 To len 
      *pb = *mem + *pa\a  
      *pa\a = *pb\a   
      *pa+1 
   Next 
  
EndProcedure 

DataSection : lookupTable:  
    Data.a $0,$1,$2,$3,$4,$5,$6,$7,$8,$9,$A,$B,$C,$D,$E,$F,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$1A,$1B,$1C,$1D,$1E,$1F,$20
    Data.a $21,$22,$23,$24,$25,$26,$27,$28,$29,$2A,$2B,$2C,$2D,$2E,$2F,$30,$31,$32,$33,$34,$35,$36,$37,$38,$39,$3A,$3B,$3C,$3D,$3E
    Data.a $3F,$40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$4A,$4B,$4C,$4D,$4E,$4F,$50,$51,$52,$53,$54,$55,$56,$57,$58,$59,$5A,$5B,$5C
    Data.a $5D,$5E,$5F,$60,$61,$62,$63,$64,$65,$66,$67,$68,$69,$6A,$6B,$6C,$6D,$6E,$6F,$70,$71,$72,$73,$74,$75,$76,$77,$78,$79,$7A
    Data.a $7B,$7C,$7D,$7E,$7F,$80,$81,$82,$83,$84,$85,$86,$87,$88,$89,$8A,$8B,$8C,$8D,$8E,$8F,$90,$91,$92,$93,$94,$95,$96,$97,$98
    Data.a $99,$9A,$9B,$9C,$9D,$9E,$9F,$A0,$A1,$A2,$A3,$A4,$A5,$A6,$A7,$A8,$A9,$AA,$AB,$AC,$AD,$AE,$AF,$B0,$B1,$B2,$B3,$B4,$B5
    Data.a $B6,$B7,$B8,$B9,$BA,$BB,$BC,$BD,$BE,$BF,$41,$41,$41,$41,$C4,$41,$41,$43,$45,$45,$45,$45,$49,$49,$49,$49,$44,$4E,$4F,$4F
    Data.a $4F,$4F,$D6,$D7,$4F,$55,$55,$55,$DC,$59,$DE,$DF,$61,$61,$61,$61,$E4,$61,$61,$63,$65,$65,$65,$65,$69,$69,$69,$69,$F0,$6E
    Data.a $6F,$6F,$6F,$6F,$F6,$F7,$6F,$75,$75,$75,$FC,$79,$FE
EndDataSection 

; Example:
Define strA.s = "ÄÂÐÉÈÖÜÚàâæçê"
Debug "Original: " + strA
ConvertSpecialChars(@strA,Len(strA)) 
Debug "Converted: " + strA 

Re: Convert all special chars in a text into regular letters

Posted: Tue Feb 20, 2018 12:24 am
by Andre
I just came across this older thread, as I'm now need the converting function (lookup table above) again, but in an extended form to convert also chars above 255.

Just like this example, where the chars are the same like "Chr(250" till "Chr(382)".
úûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž
It should be converted into plain latin chars like "UaAaOoZzCc" etc.

It should have something to do with Unicode, so the used .Ascii type couldn't be used anymore.
But I just don't get it right... :?

Using the Chr() function would be possible, as it correctly support all Unicode characters. But changing every char of many thousand data strings would slow-down a search function etc. a lot.
So I hope the Convertxxx() function above could be converted/extended to support special chars like the ones above too.
I just need a fast solution (using lookup table or similar)....

Thank you for any help! :D

Re: Convert all special chars in a text into regular letters

Posted: Tue Feb 20, 2018 7:37 am
by davido
@Andre,

You could make yourself a Map, like this:

Code: Select all

Global Ugh$ = "úûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž"
Global T$ = "uuuybyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiJjJjKkkLlLlLlLllLNnNnNnnNnOoOoOoOoRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZz"


Procedure Ugh2T()
  Debug ~"Global Ugh$ = \"úûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž\""
  Debug ~"Global T$ = \"uuuybyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiJjJjKkkLlLlLlLllLNnNnNnnNnOoOoOoOoRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZz\""
  Debug "NewMap Ugh2T$()"
 For m = 1 To Len(T$)
   Debug "For m = 1 To Len(T$)"
   Debug "Ugh2T$(Mid(Ugh$,m,1)) = Mid(T$,m,1)"
   Debug "Next m"
 Next
 
  
EndProcedure


; NewMap Ugh2T$()
; For m = 1 To Len(T$)
;   Ugh2T$(Mid(Ugh$,m,1)) = Mid(T$,m,1)
; Next m


Ugh2T()
Just copy the debug as your code.
You would be advised to check the two strings. :)

Re: Convert all special chars in a text into regular letters

Posted: Wed Feb 21, 2018 11:49 pm
by Andre
Thank you, davido :D

"Use a map" was the important point. So I came up with the following re-written example code, complete for converting special chars and also german umlauts:

Code: Select all


; -----------------------------------------------------------------------------------------------------------
; PB forum: http://www.purebasic.fr/english/viewtopic.php?f=13&t=52782
; by André
Procedure Init_SpecialCharsMap()
  ; This function need to be called once (after program start / before first use of the converter function)
  ; to build the map with all special chars (as key) and the corresponding plain latin chars (as value).
  ;
  ; Chars of the range Chr(192) till Chr(382):
  ; (chars which can't be converted into a correct 1-char are marked with 'x')
  Protected Org$  = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž"
  Protected Conv$ = "AAAAAAACEEEEIIIIDNOOOOOxOUUUUYxsaaaaaaxceeeeiiiixnoooooxouuuuyxyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIixxJjKkkLlLlLlLlLlNnNnNnnNnOoOoOoXxRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZz"
  Protected a, len = Len(Org$)
  Global NewMap SpecialChars$()
  For a = 1 To len
    SpecialChars$(Mid(Org$, a, 1)) = Mid(Conv$, a, 1)
  Next  
EndProcedure

Procedure.s ConvertSpecialChars(String$)
  ; This function uses the previously build map SpecialChars$() for converting special chars
  ; into their corresponding plain latin chars...
  Protected ResultString$, char$, a, Len = Len(String$)
  For a = 1 To Len
    char$ = Mid(String$, a, 1)
    If Asc(char$) > 191    ; we will only try to convert chars which are Chr(192) and higher
      If FindMapElement(SpecialChars$(), char$)
        char$ = SpecialChars$()   ; we got a converted char
      EndIf
    EndIf
    ResultString$ + char$    ; we add the next char (converted if needed) to the result string
  Next
  ProcedureReturn ResultString$
EndProcedure

Procedure.s ConvertUmlautsnSpecialChars(String.s, ConvertUmlauts2TwoChars = #True)
  If ConvertUmlauts2TwoChars = #True
    String = ReplaceString(String, "ä", "ae")
    String = ReplaceString(String, "ö", "oe")
    String = ReplaceString(String, "ü", "ue")
    String = ReplaceString(String, "Ä", "Ae")
    String = ReplaceString(String, "Ö", "Oe")
    String = ReplaceString(String, "Ü", "Ue")
    String = ReplaceString(String, "ß", "ss")
    ; String = ReplaceString(String, "....", "....")
  EndIf
  
  ; ... and now convert all other special chars too:
  String = ConvertSpecialChars(String)
  ProcedureReturn String
EndProcedure

;- Example:
EnableExplicit

Init_SpecialCharsMap()
Define a$

; At first 4 "real world" examples (city names, for which also converting umlauts into 2-chars is needed):
a$ = "Neumünster"
Debug "Org : " + a$ 
Debug "conv: " + ConvertUmlautsnSpecialChars(a$)

a$ = "Štramberk"
Debug "Org : " + a$ 
Debug "conv: " + ConvertUmlautsnSpecialChars(a$)

a$ = "Žďár"
Debug "Org : " + a$ 
Debug "conv: " + ConvertUmlautsnSpecialChars(a$)

a$ = "Osová Bítýška"
Debug "Org : " + a$ 
Debug "conv: " + ConvertUmlautsnSpecialChars(a$)

; Now the complete string with special chars (no umlauts converting needed):
a$ = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž"
Debug "Org : " + a$ 
Debug "conv: " + ConvertSpecialChars(a$)
Debug output wrote: Org : Neumünster
conv: Neumuenster
Org : Štramberk
conv: Stramberk
Org : Žďár
conv: Zdar
Org : Osová Bítýška
conv: Osova Bityska
Org : ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž
conv: AAAAAAACEEEEIIIIDNOOOOOxOUUUUYxsaaaaaaxceeeeiiiixnoooooxouuuuyxyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIixxJjKkkLlLlLlLlLlNnNnNnnNnOoOoOoXxRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZz
So this seems to perfectly work... :mrgreen:

Only drawback can be, that it uses pure string-handling and map stuff, while other implementations (pointer, array, ASM, etc.) are probably faster. So any further ideas for (speed) improvements are welcome! (especially if you want to convert thousands of such names / text strings..., like in my case)

Re: Convert all special chars in a text into regular letters

Posted: Fri Feb 23, 2018 9:59 pm
by Andre
After a lot of try & error I got a slightly faster version of the ConvertSpecialChars() function. :-)

The following code is the same than before, but extended with the re-written ConvertSpecialChars_NEW() function and a little speed-test... :wink:
I got as result 1,292 ms with the old function, and 1,047 ms with the new one.

Further speed improvements are welcome. I'm sure, some experts can do this with Peek & Poke or some ASM functions... :mrgreen:

Code: Select all

; -----------------------------------------------------------------------------------------------------------
; PB forum: http://www.purebasic.fr/english/viewtopic.php?f=13&t=52782
; by André
Procedure Init_SpecialCharsMap()
  ; This function need to be called once (after program start / before first use of the converter function)
  ; to build the map with all special chars (as key) and the corresponding plain latin chars (as value).
  ;
  ; Chars of the range Chr(192) till Chr(382):
  ; (chars which can't be converted into a correct 1-char are marked with 'x')
  Protected Org$  = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž"
  Protected Conv$ = "AAAAAAACEEEEIIIIDNOOOOOxOUUUUYxsaaaaaaxceeeeiiiixnoooooxouuuuyxyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIixxJjKkkLlLlLlLlLlNnNnNnnNnOoOoOoXxRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZz"
  Protected a, len = Len(Org$)
  Global NewMap SpecialChars$()
  For a = 1 To len
    SpecialChars$(Mid(Org$, a, 1)) = Mid(Conv$, a, 1)
  Next  
EndProcedure

Procedure.s ConvertSpecialChars(String$)
  ; This function uses the previously build map SpecialChars$() for converting special chars
  ; into their corresponding plain latin chars...
  Protected ResultString$, char$, a, Len = Len(String$)
  For a = 1 To Len
    char$ = Mid(String$, a, 1)
    If Asc(char$) > 191    ; we will only try to convert chars which are Chr(192) and higher
      If FindMapElement(SpecialChars$(), char$)
        char$ = SpecialChars$()   ; we got a converted char
      EndIf
    EndIf
    ResultString$ + char$    ; we add the next char (converted if needed) to the result string
  Next
  ProcedureReturn ResultString$
EndProcedure

Procedure.s ConvertSpecialChars_NEW(*String.String, Len)
  ; This function uses the previously build map SpecialChars$() for converting special chars
  ; into their corresponding plain latin chars...
  Protected char$, a, *WorkString.String
  *WorkString = @*String
  For a = 1 To Len
    char$ = Mid(*WorkString\s, a, 1)
    If Asc(char$) > 191    ; we will only try to convert chars which are Chr(192) and higher
      If FindMapElement(SpecialChars$(), char$)
        ; we got a converted char:
        ReplaceString(*WorkString\s, char$, SpecialChars$(), #PB_String_InPlace, a)
      EndIf
    EndIf
  Next
  ;Debug *WorkString\s
EndProcedure

Procedure.s ConvertUmlautsnSpecialChars(String.s, ConvertUmlauts2TwoChars = #True)
  If ConvertUmlauts2TwoChars = #True
    String = ReplaceString(String, "ä", "ae")
    String = ReplaceString(String, "ö", "oe")
    String = ReplaceString(String, "ü", "ue")
    String = ReplaceString(String, "Ä", "Ae")
    String = ReplaceString(String, "Ö", "Oe")
    String = ReplaceString(String, "Ü", "Ue")
    String = ReplaceString(String, "ß", "ss")
    ; String = ReplaceString(String, "....", "....")
  EndIf
  
  ; ... and now convert all other special chars too:
  String = ConvertSpecialChars(String)
  ProcedureReturn String
EndProcedure

;- Example:
EnableExplicit

Init_SpecialCharsMap()
Define a$, time, a

; At first 4 "real world" examples (city names, for which also converting umlauts into 2-chars is needed):
a$ = "Neumünster"
Debug "Org : " + a$ 
Debug "conv: " + ConvertUmlautsnSpecialChars(a$)

a$ = "Štramberk"
Debug "Org : " + a$ 
Debug "conv: " + ConvertUmlautsnSpecialChars(a$)

a$ = "Žďár"
Debug "Org : " + a$ 
Debug "conv: " + ConvertUmlautsnSpecialChars(a$)

a$ = "Osová Bítýška"
Debug "Org : " + a$ 
Debug "conv: " + ConvertUmlautsnSpecialChars(a$)

; Now the complete string with special chars (no umlauts converting needed):
a$ = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž"
Debug "Org : " + a$ 
Debug "conv: " + ConvertSpecialChars(a$)

a$ = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž"
Debug "Org : " + a$ 
ConvertSpecialChars_NEW(@a$, Len(a$))
Debug "conv: " + a$

; And now a little speed test with 10,000 loops:     [deactive Debugger!]
; -----------------------------------------------------------------------
; Variant 1  (string-like function):
time = ElapsedMilliseconds()
For a = 1 To 10000
  a$ = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž"
  a$ = ConvertSpecialChars(a$)
Next
time = ElapsedMilliseconds() - time
MessageRequester("Variant 1", "Time needed for 10,000 loops with the 'string-like' ConvertSpecialChars() routine: " + Str(time) + " ms", #PB_MessageRequester_Info)

; Variant 2  (in-place replace of special chars):
time = ElapsedMilliseconds()
For a = 1 To 10000
  a$ = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž"
  ConvertSpecialChars_NEW(@a$, Len(a$))
Next
time = ElapsedMilliseconds() - time
MessageRequester("Variant 2", "Time needed for 10,000 loops with the 'in-place replace' ConvertSpecialChars() routine: " + Str(time) + " ms", #PB_MessageRequester_Info)

Re: Convert all special chars in a text into regular letters

Posted: Sat Feb 24, 2018 8:01 am
by wilbert
Here's another variation Andre.
The procedure accepting a string pointer (ConvertSpecialCharsPtr) is the fastest one.

Code: Select all

Procedure Init_SpecialChars()
  
  Protected Org.s, Conv.s, *Org.Unicode, *Conv.Long, o.i, c.i
  
  Org  = "À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ Ė ė Ę ę Ě ě Ĝ ĝ Ğ ğ "
  Conv = "A A A A AeA A C E E E E I I I I D N O O O O Oex O U U U UeY x ssa a a a aea x c e e e e i i i i x n o o o o oex o u u u uey x y A a A a A a C c C c C c C c D d D d E e E e E e E e E e G g G g "
  Org  + "Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į İ ı IJ ij Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ ŀ Ł ł Ń ń Ņ ņ Ň ň ʼn Ŋ ŋ Ō ō Ŏ ŏ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş Š š Ţ ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž " 
  Conv + "G g G g H h H h I i I i I i I i I i x x J j K k k L l L l L l L l L l N n N n N n n N n O o O o O o X x R r R r R r S s S s S s S s T t T t T t U u U u U u U u U u U u W w Y y Y Z z Z z Z z "

  Global Dim SpecialChars.l(65535)
  For c = 1 To 65535
    SpecialChars(c) = c
  Next
  
  *Org = @Org : *Conv = @Conv
  While *Org\u
    o = *Org\u  : *Org  + 4
    c = *Conv\l : *Conv + 4
    If c >> 16 = 32
      c & $ffff
    EndIf
    SpecialChars(o) = c
  Wend
  
EndProcedure

Init_SpecialChars()

Procedure.s ConvertSpecialChars(String.s)
  Protected Dim CharArray.l(Len(String))
  Protected *Org.Unicode = @String, *Conv.Long = @CharArray()
  While *Org\u
    *Conv\l = SpecialChars(*Org\u)
    If *Conv\l >> 16
      *Conv + 4
    Else
      *Conv + 2
    EndIf
    *Org + 2
  Wend
  ProcedureReturn PeekS(@CharArray())
EndProcedure

Procedure.s ConvertSpecialCharsPtr(*String)
  Protected Dim CharArray.l(MemoryStringLength(*String))
  Protected *Org.Unicode = *String, *Conv.Long = @CharArray()
  While *Org\u
    *Conv\l = SpecialChars(*Org\u)
    If *Conv\l >> 16
      *Conv + 4
    Else
      *Conv + 2
    EndIf
    *Org + 2
  Wend
  ProcedureReturn PeekS(@CharArray())
EndProcedure




a$ = "Neumünster"
Debug "Org : " + a$ 
Debug "conv: " + ConvertSpecialChars(a$)

a$ = "Štramberk"
Debug "Org : " + a$ 
Debug "conv: " + ConvertSpecialCharsPtr(@a$)

a$ = "Žďár"
Debug "Org : " + a$ 
Debug "conv: " + ConvertSpecialChars(a$)

a$ = "Osová Bítýška"
Debug "Org : " + a$ 
Debug "conv: " + ConvertSpecialCharsPtr(@a$)

Re: Convert all special chars in a text into regular letters

Posted: Sat Feb 24, 2018 11:42 pm
by Andre
Thank you very much, wilbert! :D

Your new codes used in the same speed test like mine above only took 59 ms and 19 ms (debugger off), so they are up to 50 times faster. Wow! :mrgreen:

As there are to convert >100,000 of name strings, I wouldn't complain if ASM variants are faster... 8)
(Would need both 32 and 64 bit variants, as the project should be compiled for both processor platforms and also on Windows + MacOS).

Re: Convert all special chars in a text into regular letters

Posted: Sun Feb 25, 2018 7:25 am
by wilbert
I don't know if you noticed but the code I posted does the one character (á > a) and two character conversions (ä > ae) in one pass.
So at the moment it behaves like your ConvertUmlautsnSpecialChars procedure with the optional argument set to #True.
The basic code was already optimized a lot so there's not that much speed improvement when converting to asm.
But if every bit helps, here's the asm version of my previous code. :)

Code: Select all

Procedure Init_SpecialChars()
  
  Protected Org.s, Conv.s, *Org.Unicode, *Conv.Long, o.i, c.i
  
  Org  = "À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ Ė ė Ę ę Ě ě Ĝ ĝ Ğ ğ "
  Conv = "A A A A AeA A C E E E E I I I I D N O O O O Oex O U U U UeY x ssa a a a aea x c e e e e i i i i x n o o o o oex o u u u uey x y A a A a A a C c C c C c C c D d D d E e E e E e E e E e G g G g "
  Org  + "Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į İ ı IJ ij Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ ŀ Ł ł Ń ń Ņ ņ Ň ň ʼn Ŋ ŋ Ō ō Ŏ ŏ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş Š š Ţ ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž " 
  Conv + "G g G g H h H h I i I i I i I i I i x x J j K k k L l L l L l L l L l N n N n N n n N n O o O o O o X x R r R r R r S s S s S s S s T t T t T t U u U u U u U u U u U u W w Y y Y Z z Z z Z z "

  Global Dim SpecialChars.l(65535)
  For c = 1 To 65535
    SpecialChars(c) = c
  Next
  
  *Org = @Org : *Conv = @Conv
  While *Org\u
    o = *Org\u  : *Org  + 4
    c = *Conv\l : *Conv + 4
    If c >> 16 = 32
      c & $ffff
    EndIf
    SpecialChars(o) = c
  Wend
  
EndProcedure

Init_SpecialChars()

Macro M_ConvertSpecialChars(reg_a, reg_b, reg_c, reg_d)
  !mov reg_a, [p.p_ConversionTable]
  !mov reg_c, [p.p_String]
  !mov reg_d, [p.a_OutputBuffer]
  !push reg_b
  !jmp .l1
  !.l0:
  !mov [reg_d], bx
  !add reg_c, 2
  !add reg_d, 2
  !.l1:
  !movzx ebx, word [reg_c]
  !test ebx, ebx
  !jz .l2
  !mov ebx, [reg_a + reg_b*4]
  !test ebx, 0xffff0000
  !jz .l0
  !mov [reg_d], ebx
  !add reg_c, 2
  !add reg_d, 4
  !jmp .l1
  !.l2:
  !pop reg_b
EndMacro

Procedure.s ConvertSpecialChars(String.s)
  Protected Dim OutputBuffer.l(Len(String))
  Protected *String = @String, *ConversionTable = @SpecialChars()
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    M_ConvertSpecialChars(rax, rbx, rcx, rdx)
  CompilerElse
    M_ConvertSpecialChars(eax, ebx, ecx, edx)
  CompilerEndIf
  ProcedureReturn PeekS(@OutputBuffer())
EndProcedure

Procedure.s ConvertSpecialCharsPtr(*String)
  Protected Dim OutputBuffer.l(MemoryStringLength(*String))
  Protected *ConversionTable = @SpecialChars()
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    M_ConvertSpecialChars(rax, rbx, rcx, rdx)
  CompilerElse
    M_ConvertSpecialChars(eax, ebx, ecx, edx)
  CompilerEndIf
  ProcedureReturn PeekS(@OutputBuffer())
EndProcedure




a$ = "Neumünster"
Debug "Org : " + a$ 
Debug "conv: " + ConvertSpecialChars(a$)

a$ = "Štramberk"
Debug "Org : " + a$ 
Debug "conv: " + ConvertSpecialCharsPtr(@a$)

a$ = "Žďár"
Debug "Org : " + a$ 
Debug "conv: " + ConvertSpecialChars(a$)

a$ = "Osová Bítýška"
Debug "Org : " + a$ 
Debug "conv: " + ConvertSpecialCharsPtr(@a$)

It's also possible to go for a more universal approach where you can use different conversion tables.

Code: Select all

;- >> Character conversion procedures <<

Macro M_ConvertChars(reg_a, reg_b, reg_c, reg_d)
  !mov reg_a, [p.p_ConversionTable]
  !mov reg_c, [p.p_String]
  !mov reg_d, [p.a_OutputBuffer]
  !push reg_b
  !jmp .l1
  !.l0:
  !mov [reg_d], bx
  !add reg_c, 2
  !add reg_d, 2
  !.l1:
  !movzx ebx, word [reg_c]
  !test ebx, ebx
  !jz .l2
  !mov ebx, [reg_a + reg_b*4]
  !test ebx, 0xffff0000
  !jz .l0
  !mov [reg_d], ebx
  !add reg_c, 2
  !add reg_d, 4
  !jmp .l1
  !.l2:
  !pop reg_b
EndMacro

Procedure.s ConvertChars(String.s, *ConversionTable)
  Protected Dim OutputBuffer.l(Len(String))
  Protected *String = @String
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    M_ConvertChars(rax, rbx, rcx, rdx)
  CompilerElse
    M_ConvertChars(eax, ebx, ecx, edx)
  CompilerEndIf
  ProcedureReturn PeekS(@OutputBuffer())
EndProcedure

Procedure.s ConvertCharsPtr(*String, *ConversionTable)
  Protected Dim OutputBuffer.l(MemoryStringLength(*String))
  CompilerIf #PB_Compiler_Processor = #PB_Processor_x64
    M_ConvertChars(rax, rbx, rcx, rdx)
  CompilerElse
    M_ConvertChars(eax, ebx, ecx, edx)
  CompilerEndIf
  ProcedureReturn PeekS(@OutputBuffer())
EndProcedure


;- >> Init character conversion tables <<

Procedure Init_ConversionTables()
  
  Protected Org.s, Conv.s, *Org.Unicode, *Conv.Long, o.i, c.i
  
  Global Dim CT_SpecialChars.l(65535)
  Global Dim CT_Lowercase.l(65535)
  Global Dim CT_Uppercase.l(65535)
  
  For c = 1 To 65535
    CT_SpecialChars(c) = c
    CT_Lowercase(c) = c
    CT_Uppercase(c) = c
  Next
  
  ; Lowercase
  
  Org  = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z "
  Conv = "a b c d e f g h i j k l m n o p q r s t u v w x y z "
  
  *Org = @Org : *Conv = @Conv
  While *Org\u
    o = *Org\u  : *Org  + 4
    c = *Conv\l : *Conv + 4
    If c >> 16 = 32
      c & $ffff
    EndIf
    CT_Lowercase(o) = c
  Wend
  
  ; Uppercase
  
  Org  = "a b c d e f g h i j k l m n o p q r s t u v w x y z "
  Conv = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z "
  
  *Org = @Org : *Conv = @Conv
  While *Org\u
    o = *Org\u  : *Org  + 4
    c = *Conv\l : *Conv + 4
    If c >> 16 = 32
      c & $ffff
    EndIf
    CT_Uppercase(o) = c
  Wend
  
  ; Special chars
  
  Org  = "À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ Ė ė Ę ę Ě ě Ĝ ĝ Ğ ğ "
  Conv = "A A A A AeA A C E E E E I I I I D N O O O O Oex O U U U UeY x ssa a a a aea x c e e e e i i i i x n o o o o oex o u u u uey x y A a A a A a C c C c C c C c D d D d E e E e E e E e E e G g G g "
  Org  + "Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į İ ı IJ ij Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ ŀ Ł ł Ń ń Ņ ņ Ň ň ʼn Ŋ ŋ Ō ō Ŏ ŏ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş Š š Ţ ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž " 
  Conv + "G g G g H h H h I i I i I i I i I i x x J j K k k L l L l L l L l L l N n N n N n n N n O o O o O o X x R r R r R r S s S s S s S s T t T t T t U u U u U u U u U u U u W w Y y Y Z z Z z Z z "
  
  *Org = @Org : *Conv = @Conv
  While *Org\u
    o = *Org\u  : *Org  + 4
    c = *Conv\l : *Conv + 4
    If c >> 16 = 32
      c & $ffff
    EndIf
    CT_SpecialChars(o) = c
  Wend
  
EndProcedure

Init_ConversionTables()




;- >> Test code <<

a$ = "Mixed Case"
Debug "Org : " + a$ 
Debug "conv: " + ConvertChars(a$, @CT_Lowercase())
Debug "conv: " + ConvertChars(a$, @CT_Uppercase())

a$ = "Neumünster"
Debug "Org : " + a$ 
Debug "conv: " + ConvertChars(a$, @CT_SpecialChars())

a$ = "Štramberk"
Debug "Org : " + a$ 
Debug "conv: " + ConvertCharsPtr(@a$, @CT_SpecialChars())

a$ = "Žďár"
Debug "Org : " + a$ 
Debug "conv: " + ConvertChars(a$, @CT_SpecialChars())

a$ = "Osová Bítýška"
Debug "Org : " + a$ 
Debug "conv: " + ConvertCharsPtr(@a$, @CT_SpecialChars())

Re: Convert all special chars in a text into regular letters

Posted: Mon Feb 26, 2018 10:11 pm
by Andre
wilbert wrote:I don't know if you noticed but the code I posted does the one character (á > a) and two character conversions (ä > ae) in one pass.
So at the moment it behaves like your ConvertUmlautsnSpecialChars procedure with the optional argument set to #True.
The basic code was already optimized a lot so there's not that much speed improvement when converting to asm.
Thanks again, wilbert! :D

The 'one pass' conversion for one and two characters should be fine, as both is needed for correct sorting of large linkedlists with names and similar. And for search routines it should be also ok, if simply both (search term and data to search in) were converted before searching.

I just need to adapt my project code, because now the converted string is returned instead the previously used 'in-place' conversion. So this would take a bit (especially as I'm currently busy with another task). But I will report here, how it works when it's done.