Page 1 of 1

Convert/normalize special characters like "ê" to regular "e"

Posted: Thu Nov 12, 2015 3:31 am
by Keya
Hello! does anyone have a procedure to convert special characters such as "èéêë" to their regular counterpart "e" ? just a simple ASCII one that is
Im happy to make one myself of course (seems like itd be fun to try in asm) but i just had a feeling its already been done, but my searches were bleh!

Code: Select all

For i = 181 To 256
  s.s = s.s + Str(i) + ":" + Chr(i) + #TAB$
  If Mod(i,10) = 0: s = s + #CRLF$: EndIf
Next i
Debug s

;... =
; 181:µ	182:¶		183:·		184:¸		185:¹		186:º		187:»		188:¼		189:½		190:¾	
; 191:¿	192:À		193:Á		194:Â		195:Ã		196:Ä		197:Å		198:Æ		199:Ç		200:È	
; 201:É	202:Ê		203:Ë		204:Ì		205:Í		206:Î		207:Ï		208:Ð		209:Ñ		210:Ò	
; 211:Ó	212:Ô		213:Õ		214:Ö		215:×		216:Ø		217:Ù		218:Ú		219:Û		220:Ü	
; 221:Ý	222:Þ		223:ß		224:à		225:á		226:â		227:ã		228:ä		229:å		230:æ	
; 231:ç	232:è		233:é		234:ê		235:ë		236:ì		237:í		238:î		239:ï		240:ð	
; 241:ñ	242:ò		243:ó		244:ô		245:õ		246:ö		247:÷		248:ø		249:ù		250:ú	
; 251:û	252:ü		253:ý		254:þ		255:ÿ	

;My interpretation:
;181=u
;192-197=A
;198=(AE?)
;199=C
;200-203=E
;204-207=I
;208=D
;209=N
;210-214=O
;215=X
;216=O (or Q?)
;217-220=U
;221=Y
;222=(b?)
;223=B
;224-229=a
;230=(ae?)
;231=c
;232-235=e
;236-239=i
;240=(o?)
;241=n
;242-246=o
;249-252=u
;253=y
;254=(b?)
;255=y
Im not too keen on treating "Æ" -> "AE" (2 bytes) though as cant inline-replace then! and it's the only exception, so perhaps it should just -> "A" ... it does come immediately after the string of 5 other special A's afterall

{update} oh nice, i thought of searching for "Case 217" site:purebasic.fr, as i figured it might be part of the code, and sure enough came up with a valid (but just the one!) French thread :)
http://www.purebasic.fr/french/viewtopi ... =1&t=11888
None look particularly efficient though in regards to doing thousands of strings, so i still might try my luck at an asm version :)

Re: Convert/normalize special characters like "ê" to regular

Posted: Thu Nov 12, 2015 4:40 am
by glomph

Code: Select all

Procedure.s convert(a$)
  a$=ReplaceString(a$,Chr(146),"'")
  a$=ReplaceString(a$,Chr(96),"'")
  a$=ReplaceString(a$,Chr(145),"'")
  a$=ReplaceString(a$,Chr(154),"S")
  a$=ReplaceString(a$,Chr(158),"Z")
  a$=ReplaceString(a$,Chr(156),"Ö")
  
  a$=ReplaceString(a$,Chr(192),"A")
  a$=ReplaceString(a$,Chr(193),"A")
  a$=ReplaceString(a$,Chr(194),"A")
  a$=ReplaceString(a$,Chr(195),"A")
  a$=ReplaceString(a$,Chr(197),"A")
  a$=ReplaceString(a$,Chr(198),"Ae")
  a$=ReplaceString(a$,Chr(199),"C")
  
  a$=ReplaceString(a$,Chr(201),"E")
  a$=ReplaceString(a$,Chr(202),"E")
  a$=ReplaceString(a$,Chr(203),"E")
  a$=ReplaceString(a$,Chr(204),"I")
  a$=ReplaceString(a$,Chr(205),"I")
  a$=ReplaceString(a$,Chr(206),"I")
  a$=ReplaceString(a$,Chr(208),"Dh")
  a$=ReplaceString(a$,Chr(209),"N")
  
  a$=ReplaceString(a$,Chr(210),"O")
  a$=ReplaceString(a$,Chr(211),"O")
  a$=ReplaceString(a$,Chr(212),"O")
  a$=ReplaceString(a$,Chr(213),"O")
  a$=ReplaceString(a$,Chr(216),"Ö")
  a$=ReplaceString(a$,Chr(217),"U")
  a$=ReplaceString(a$,Chr(218),"U")
  a$=ReplaceString(a$,Chr(219),"U")
  
  a$=ReplaceString(a$,Chr(221),"Y")
  a$=ReplaceString(a$,Chr(222),"th")
  a$=ReplaceString(a$,Chr(224),"a")
  a$=ReplaceString(a$,Chr(225),"a")
  a$=ReplaceString(a$,Chr(226),"a")
  a$=ReplaceString(a$,Chr(227),"a")
  a$=ReplaceString(a$,Chr(229),"a")
  a$=ReplaceString(a$,Chr(229),"a")
  
  a$=ReplaceString(a$,Chr(230),"ae")
  a$=ReplaceString(a$,Chr(231),"e")
  a$=ReplaceString(a$,Chr(232),"e")
  a$=ReplaceString(a$,Chr(233),"e")
  a$=ReplaceString(a$,Chr(234),"e")
  a$=ReplaceString(a$,Chr(235),"e")
  a$=ReplaceString(a$,Chr(236),"i")
  a$=ReplaceString(a$,Chr(237),"i")
  a$=ReplaceString(a$,Chr(238),"i")
  a$=ReplaceString(a$,Chr(239),"i")
  
  a$=ReplaceString(a$,Chr(240),"o")
  a$=ReplaceString(a$,Chr(241),"o")
  a$=ReplaceString(a$,Chr(242),"o")
  a$=ReplaceString(a$,Chr(243),"o")
  a$=ReplaceString(a$,Chr(244),"o")
  a$=ReplaceString(a$,Chr(245),"o")
  
  a$=ReplaceString(a$,Chr(248),"ö")
  a$=ReplaceString(a$,Chr(249),"u")
  a$=ReplaceString(a$,Chr(250),"u")
  a$=ReplaceString(a$,Chr(251),"u")
  a$=ReplaceString(a$,Chr(254),"Dh")
  a$=ReplaceString(a$,Chr(34),"'")
  a$=ReplaceString(a$,"("," ")

ProcedureReturn a$
EndProcedure
German ÄÖÜäöüß not done.
No warranty

Re: Convert/normalize special characters like "ê" to regular

Posted: Thu Nov 12, 2015 6:42 am
by wilbert
Keya wrote:None look particularly efficient though in regards to doing thousands of strings, so i still might try my luck at an asm version :)
The fastest method is probably a lookup table.

Re: Convert/normalize special characters like "ê" to regular

Posted: Thu Nov 12, 2015 6:58 am
by Keya
Thankyou very much glomph! :)
wilbert wrote:
Keya wrote:None look particularly efficient though in regards to doing thousands of strings, so i still might try my luck at an asm version :)
The fastest method is probably a lookup table.
By lookup table do you mean the XLATB instruction? I only used that for the first time about a month ago so id like to try that again soon!
but my first try is the humble cmp, jmp, mov, x 1000 heehee :)
I try XLATB version after that! it will be intersting to check speeds
but i think i might have to make a "db" data for all 256 chars? not that that's an issue. I wish all asm tasks were this easy lol
i will post my code for your personal amusement shortly :D

Re: Convert/normalize special characters like "ê" to regular

Posted: Thu Nov 12, 2015 7:17 am
by Keya
Output of chars 180-255, where all the magic happens :)

Code: Select all

Before: ´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
Output: ´u¶·¸¹º»¼½¾¿AAAAAAACEEEEIIIIDNOOOOOxOUUUUYbBaaaaaaaceeeeiiiionooooo÷ouuuuyby
It DOESNT touch these 'extra special' characters however: ƒ†ŠŒŽšœžŸ¥§
The reason is because theyre all <180 so i was hoping to save execution time by being able to ignore that big chunk. But with my next version using a lookup table i'll translate them all

Tested on 5.40 LTS Win32 and Mac OSX 64, wilbert i used your sneaky register macro for the hybrid support, really nice trick makes life easy, thankingyou :) :)

I also divided the 180-255 range into halves to hopefully ~halve the number of cmp/jmp's on average, but i dont really know much about optimization (if that isnt obvious!) :)

Code: Select all

CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
  Macro rax : eax : EndMacro   ;thanks wilbert! "Sometimes the simple things in life..."
CompilerEndIf
  
  
Procedure NormalizeSpecialChars(*pbytes)   ;ptr to null-terminated string
  EnableASM        ;only uses eax and edx, no register preservation required
  mov rax, *pbytes
  
  !nextbyte:
     mov dl, [rax]   
   ! cmp dl, 181
   ! jae testspecialchar
   ! cmp dl, 0
   ! jz endproc
   CompilerIf #PB_Compiler_Unicode = 1
     add rax, 2
   CompilerElse
     inc rax
   CompilerEndIf
   ! jmp nextbyte
   
  !testspecialchar:
   ! cmp dl, 221
   ! jae Try2ndRange
   
  !Try1stRange:  
   
   ! cmp dl, 181
   ! jne Try192
   ! mov dl, 'u'
   ! jmp savenext
   
   !Try192:
   ! cmp dl, 192
   ! jnae endnextbyte
   ! cmp dl, 197
   ! ja Try198
   ! mov dl, 'A'
   ! jmp savenext
   
   !Try198:      
   ! cmp dl, 198
   ! jne Try199
   ! mov dl, 'A' 
   ! jmp savenext
   
   !Try199:
   ! cmp dl, 199
   ! jne Try200
   ! mov dl, 'C'
   ! jmp savenext
   
   !Try200:
   ! cmp dl, 204
   ! jae Try204
   ! mov dl, 'E'
   ! jmp savenext
   
   !Try204:
   ! cmp dl, 208
   ! jae Try208
   ! mov dl, 'I'
   ! jmp savenext
   
   !Try208:
   ! ja Try209
   ! mov dl, 'D'
   ! jmp savenext
   
   !Try209:
   ! cmp dl, 209
   ! jne Try210
   ! mov dl, 'N'
   ! jmp savenext
      
   !Try210:
   ! cmp dl, 215
   ! jae Try215
   ! mov dl, 'O'
   ! jmp savenext   
   
   !Try215:
   ! ja Try216
   ! mov dl, 'x'
   ! jmp savenext   
      
   !Try216:
   ! cmp dl, 216
   ! ja Try217
   ! mov dl, 'O'
   ! jmp savenext   
   
   !Try217: 
   ! mov dl, 'U'
   ! jmp savenext
   
   
  !Try2ndRange: 
   ! cmp dl, 221
   ! jne Try222
   ! mov dl, 'Y'
   ! jmp savenext
   
   !Try222:
   ! cmp dl, 222
   ! jne Try223
   ! mov dl, 'b'
   ! jmp savenext
   
   !Try223:
   ! cmp dl, 223
   ! jne Try224
   ! mov dl, 'B'
   ! jmp savenext
   
   !Try224:
   ! cmp dl, 230
   ! jae Try230
   ! mov dl, 'a'
   ! jmp savenext
   
   !Try230: 
   ! ja Try231
   ! mov dl, 'a' 
   ! jmp savenext
      
   !Try231:
   ! cmp dl, 232
   ! jae Try232
   ! mov dl, 'c'
   ! jmp savenext
   
   !Try232:
   ! cmp dl, 236
   ! jae Try236
   ! mov dl, 'e'
   ! jmp savenext
   
   !Try236:
   ! cmp dl, 240
   ! jae Try240
   ! mov dl, 'i'
   ! jmp savenext
   
   !Try240:
   ! ja Try241
   ! mov dl, 'o'
   ! jmp savenext
   
   !Try241:
   ! cmp dl, 241
   ! jne Try242
   ! mov dl, 'n'
   ! jmp savenext
   
   !Try242:
   ! cmp dl, 246
   ! jnbe Try248
   ! mov dl, 'o'
   ! jmp savenext
   
   !Try248:
   ! cmp dl, 248
   ! jne Try249
   ! mov dl, 'o'
   ! jmp savenext

   !Try249:
   ! cmp dl, 249
   ! jb endnextbyte
   ! cmp dl, 253
   ! jae Try253
   ! mov dl, 'u'
   ! jmp savenext
   
   !Try253:
   ! ja Try254
   ! mov dl, 'y'
   ! jmp savenext
   
   !Try254:
   ! cmp dl, 254
   ! ja Its255
   ! mov dl, 'b'
   ! jmp savenext
   !Its255:
   ! mov dl, 'y'
   
  !savenext:
     mov [rax], dl
  !endnextbyte:
   CompilerIf #PB_Compiler_Unicode = 1
     add rax, 2
   CompilerElse
     inc rax
   CompilerEndIf  
   ! jmp nextbyte
  !endproc:
  DisableASM
EndProcedure


For i = 180 To 255: sTest.s + Chr(i): Next i

Debug("Before: " + sTest)
NormalizeSpecialChars(@sTest)
Debug("Output: " + sTest)
btw is the above called a "jump table"?

i dont know if its slow or fast, i look fowrard to trying some speed tests later
ok i try XLATB lookup table version now :)

Re: Convert/normalize special characters like "ê" to regular

Posted: Thu Nov 12, 2015 8:41 am
by infratec
Hi,

that's a lookup table: :wink:

Code: Select all

Procedure.a NormalizeASCII(AsciiByte.a)
  
  Protected Result.a
  
  DataSection
    NormStart:
    Data.a $B4, $B5, $B6, $B7, $B8, $B9, $BA, $BB, $BC, $BD, $BE, $BF, $41, $41, $41
    Data.a $41, $41, $41, $41, $43, $45, $45, $45, $45, $4C, $4C, $4C, $4C, $44, $4E
    Data.a $4F, $4F, $4F, $4F, $4F, $78, $4F, $55, $55, $55, $55, $59, $DE, $73, $61
    Data.a $61, $61, $61, $61, $61, $61, $63, $65, $65, $65, $65, $69, $69, $69, $69
    Data.a $F0, $6E, $6F, $6F, $6F, $6F, $6F, $F7, $6F, $75, $75, $75, $75, $79, $FE
    Data.a $79
  EndDataSection
  
  If AsciiByte > 179
    Result = PeekA(?NormStart + AsciiByte - 180)
  Else
    Result = AsciiByte
  EndIf
  
  ProcedureReturn Result
  
EndProcedure




Text$ = "´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
*Buffer = AllocateMemory(StringByteLength(Text$, #PB_Ascii))
If *Buffer
  PokeS(*Buffer, Text$, -1, #PB_Ascii|#PB_String_NoZero)
  
  For i = 0 To Len(Text$) - 1
    ;Debug Str(PeekA(*Buffer + i)) + ": " + Chr(NormalizeASCII(PeekA(*Buffer + i)))
    Norm$ + Chr(NormalizeASCII(PeekA(*Buffer + i)))
  Next i
  
  Debug Text$
  Debug Norm$
  
  FreeMemory(*Buffer)
EndIf
Now you can implement it in asm :mrgreen:

Bernd

Re: Convert/normalize special characters like "ê" to regular

Posted: Thu Nov 12, 2015 9:02 am
by Keya
thanks infratec!! :)
I just noticed yours when posting mine.

Hmm, this translation table makes my jump-based version look a little silly, lol. Good fun though!

Code: Select all

CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
  Macro rax : eax : EndMacro   ;thanks wilbert!
  Macro rbx : ebx : EndMacro   
  Macro rdx : edx : EndMacro   
CompilerEndIf
  

Procedure NormalizeSpecialChars(*pbytes, *ptable)   ;ptr to null-terminated string
  EnableASM                 ;uses *bx (needs preserving), and *ax, *dx
  mov rax, *ptable  
  mov rdx, *pbytes
  push rbx
  mov rbx, rax
  !nextbyte:
    mov al, [rdx]
  ! cmp al, 0
  ! je endproc
  ! xlatb
  mov [rdx], al
  CompilerIf #PB_Compiler_Unicode = 1
    add rdx, 2
  CompilerElse
    inc rdx
  CompilerEndIf
  ! jmp nextbyte  
  !endproc:  
  pop rbx
  DisableASM
EndProcedure

For i = 33 To 255: sTest.s + Chr(i): Next i

Debug("Before: " + sTest)
NormalizeSpecialChars(@sTest, ?NormalizedChars)
Debug("After : " + sTest)
End


DataSection
 NormalizedChars:
 !db 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
 !db ' !"#$%&',39,'()*+,-./0123456789:',59,'<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~' 
 !db 127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160
 !db 'i¢£¤¥¦§¨©ª«¬­®¯°±²³´u¶·¸¹º»¼½¾¿AAAAAAACEEEEIIIIDNOOOOOxOUUUUYbBaaaaaaaceeeeiiiionooooo÷ouuuuyby'
EndDataSection
I will try some speed tests later but my mind needs a break for now heehee, whew!!! :)

Re: Convert/normalize special characters like "ê" to regular

Posted: Thu Nov 12, 2015 9:53 am
by wilbert
If you are going to support unicode, you might want to use a lookup table with 512 entries.
There's quite a few characters in range 256 - 511 that could be converted as well.

Re: Convert/normalize special characters like "ê" to regular

Posted: Thu Nov 12, 2015 9:59 am
by Keya
i updated the code for both my versions to support unicode, which was simply changing "inc rax" to:

Code: Select all

   CompilerIf #PB_Compiler_Unicode = 1
     add rax, 2
   CompilerElse
     inc rax
   CompilerEndIf
seems to be working fine! ?
I only want to convert the ones from the normal Ascii range, so nothing 256+ :)

Re: Convert/normalize special characters like "ê" to regular

Posted: Thu Nov 12, 2015 2:01 pm
by Keya
Here's some speed tests, results in milliseconds, it's just a length 255 string featuring all chars from 1-255, and repeated calls to the conversion function half a million (500,000) times

Code: Select all

1=glomph's convert()
2=Fc_TraitementChaine() by boddhi & falsam
3=infratec's NormalizeASCII()
4=Keya's Jumps
5=Keya's XLatb

Win-32
1 = 19980  (4976 when i updated its use of ReplaceString() to use #PB_String_InPlace!)
2 = 186750     ;<-- anomaly
3 = 46106
4 = 441
5 = 385

Win-64
1 = 18393
2 = 24343
3 = 45047
4 = 438
5 = 369

Linux-32
1 = 11281
2 = 22504
3 = 20521
4 = 333
5 = 275

Linux-64
1 = 11491
2 = 20323
3 = 18835
4 = 336
5 = 290

OSX-64
1 = 16463
2 = 15866
3 = 35567
4 = 436
5 = 365
there seems to be perhaps a compiler issue uncovered in Fc_TraitementChaine() function in 32bit Windows? i ran it several times to confirm its always dramatically slow on Win32 compared to all other OS. But anyway that's not a problem for me, and im happy with the results. Thankyou everyone for code and feedback! :)

Re: Convert/normalize special characters like "ê" to regular

Posted: Thu Nov 12, 2015 5:53 pm
by IdeasVacuum
This is a facinating post, interesting solutions. Can I ask why you would need to do this?

Re: Convert/normalize special characters like "ê" to regular

Posted: Fri Nov 13, 2015 1:54 am
by Demivec
Depending on your uses (and the associated data) you may need to also deal with more than one code page's character mappings.

Re: Convert/normalize special characters like "ê" to regular

Posted: Fri Nov 13, 2015 4:33 am
by Keya
IdeasVacuum wrote:This is a facinating post, interesting solutions. Can I ask why you would need to do this?
Hi IdeasVacuum, just to extend on what Demivec said, in my case i need to detect by comparison (for simple example) words like "creme" and "crême" as being the same (and ive got lots of words to compare so it needs to be fairly efficient). There are other approaches i can think of such as string distance comparison, but then "crime" would be just as 'different/similar' to "creme" as "crême" (so that's not really an option), so in-place character-'normalizing' seemed the best efficient approach for my needs. :)

In my case i've also gone a step further (not shown in above code - actually the code is identical, just slightly different lookup table where i replaced "abc.." with "ABC.."), so that in the same single pass it converts all characters to Uppercase also for case-insensitive comparison (as i need to confirm "Creme" and "creme" are also the same word), and the lookup table makes that really easy to convert both "ê" and "e" to "E". So i get all characters case-desensitized + normalized in one pass, two birds one quick stone :)

Re: Convert/normalize special characters like "ê" to regular

Posted: Fri Nov 13, 2015 7:26 am
by wilbert
Keya wrote:in my case i need to detect by comparison (for simple example) words like "creme" and "crême" as being the same (and ive got lots of words to compare so it needs to be fairly efficient).
Your approach is probably faster but sometimes the OS can help.
OSX for example is capable of doing a diacritic insensitive compare.
It looks like Windows might also be capable of this with CompareStringEx .