Convert/normalize special characters like "ê" to regular "e"

Just starting out? Need help? Post your questions and find answers here.
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Convert/normalize special characters like "ê" to regular "e"

Post by Keya »

Hello! does anyone have a procedure to convert special characters such as "èéêë" to their regular counterpart "e" ? just a simple ASCII one that is
Im happy to make one myself of course (seems like itd be fun to try in asm) but i just had a feeling its already been done, but my searches were bleh!

Code: Select all

For i = 181 To 256
  s.s = s.s + Str(i) + ":" + Chr(i) + #TAB$
  If Mod(i,10) = 0: s = s + #CRLF$: EndIf
Next i
Debug s

;... =
; 181:µ	182:¶		183:·		184:¸		185:¹		186:º		187:»		188:¼		189:½		190:¾	
; 191:¿	192:À		193:Á		194:Â		195:Ã		196:Ä		197:Å		198:Æ		199:Ç		200:È	
; 201:É	202:Ê		203:Ë		204:Ì		205:Í		206:Î		207:Ï		208:Ð		209:Ñ		210:Ò	
; 211:Ó	212:Ô		213:Õ		214:Ö		215:×		216:Ø		217:Ù		218:Ú		219:Û		220:Ü	
; 221:Ý	222:Þ		223:ß		224:à		225:á		226:â		227:ã		228:ä		229:å		230:æ	
; 231:ç	232:è		233:é		234:ê		235:ë		236:ì		237:í		238:î		239:ï		240:ð	
; 241:ñ	242:ò		243:ó		244:ô		245:õ		246:ö		247:÷		248:ø		249:ù		250:ú	
; 251:û	252:ü		253:ý		254:þ		255:ÿ	

;My interpretation:
;181=u
;192-197=A
;198=(AE?)
;199=C
;200-203=E
;204-207=I
;208=D
;209=N
;210-214=O
;215=X
;216=O (or Q?)
;217-220=U
;221=Y
;222=(b?)
;223=B
;224-229=a
;230=(ae?)
;231=c
;232-235=e
;236-239=i
;240=(o?)
;241=n
;242-246=o
;249-252=u
;253=y
;254=(b?)
;255=y
Im not too keen on treating "Æ" -> "AE" (2 bytes) though as cant inline-replace then! and it's the only exception, so perhaps it should just -> "A" ... it does come immediately after the string of 5 other special A's afterall

{update} oh nice, i thought of searching for "Case 217" site:purebasic.fr, as i figured it might be part of the code, and sure enough came up with a valid (but just the one!) French thread :)
http://www.purebasic.fr/french/viewtopi ... =1&t=11888
None look particularly efficient though in regards to doing thousands of strings, so i still might try my luck at an asm version :)
Last edited by Keya on Thu Nov 12, 2015 10:36 am, edited 2 times in total.
User avatar
glomph
User
User
Posts: 48
Joined: Tue Apr 27, 2010 1:43 am
Location: St. Elsewhere / Germany
Contact:

Re: Convert/normalize special characters like "ê" to regular

Post by glomph »

Code: Select all

Procedure.s convert(a$)
  a$=ReplaceString(a$,Chr(146),"'")
  a$=ReplaceString(a$,Chr(96),"'")
  a$=ReplaceString(a$,Chr(145),"'")
  a$=ReplaceString(a$,Chr(154),"S")
  a$=ReplaceString(a$,Chr(158),"Z")
  a$=ReplaceString(a$,Chr(156),"Ö")
  
  a$=ReplaceString(a$,Chr(192),"A")
  a$=ReplaceString(a$,Chr(193),"A")
  a$=ReplaceString(a$,Chr(194),"A")
  a$=ReplaceString(a$,Chr(195),"A")
  a$=ReplaceString(a$,Chr(197),"A")
  a$=ReplaceString(a$,Chr(198),"Ae")
  a$=ReplaceString(a$,Chr(199),"C")
  
  a$=ReplaceString(a$,Chr(201),"E")
  a$=ReplaceString(a$,Chr(202),"E")
  a$=ReplaceString(a$,Chr(203),"E")
  a$=ReplaceString(a$,Chr(204),"I")
  a$=ReplaceString(a$,Chr(205),"I")
  a$=ReplaceString(a$,Chr(206),"I")
  a$=ReplaceString(a$,Chr(208),"Dh")
  a$=ReplaceString(a$,Chr(209),"N")
  
  a$=ReplaceString(a$,Chr(210),"O")
  a$=ReplaceString(a$,Chr(211),"O")
  a$=ReplaceString(a$,Chr(212),"O")
  a$=ReplaceString(a$,Chr(213),"O")
  a$=ReplaceString(a$,Chr(216),"Ö")
  a$=ReplaceString(a$,Chr(217),"U")
  a$=ReplaceString(a$,Chr(218),"U")
  a$=ReplaceString(a$,Chr(219),"U")
  
  a$=ReplaceString(a$,Chr(221),"Y")
  a$=ReplaceString(a$,Chr(222),"th")
  a$=ReplaceString(a$,Chr(224),"a")
  a$=ReplaceString(a$,Chr(225),"a")
  a$=ReplaceString(a$,Chr(226),"a")
  a$=ReplaceString(a$,Chr(227),"a")
  a$=ReplaceString(a$,Chr(229),"a")
  a$=ReplaceString(a$,Chr(229),"a")
  
  a$=ReplaceString(a$,Chr(230),"ae")
  a$=ReplaceString(a$,Chr(231),"e")
  a$=ReplaceString(a$,Chr(232),"e")
  a$=ReplaceString(a$,Chr(233),"e")
  a$=ReplaceString(a$,Chr(234),"e")
  a$=ReplaceString(a$,Chr(235),"e")
  a$=ReplaceString(a$,Chr(236),"i")
  a$=ReplaceString(a$,Chr(237),"i")
  a$=ReplaceString(a$,Chr(238),"i")
  a$=ReplaceString(a$,Chr(239),"i")
  
  a$=ReplaceString(a$,Chr(240),"o")
  a$=ReplaceString(a$,Chr(241),"o")
  a$=ReplaceString(a$,Chr(242),"o")
  a$=ReplaceString(a$,Chr(243),"o")
  a$=ReplaceString(a$,Chr(244),"o")
  a$=ReplaceString(a$,Chr(245),"o")
  
  a$=ReplaceString(a$,Chr(248),"ö")
  a$=ReplaceString(a$,Chr(249),"u")
  a$=ReplaceString(a$,Chr(250),"u")
  a$=ReplaceString(a$,Chr(251),"u")
  a$=ReplaceString(a$,Chr(254),"Dh")
  a$=ReplaceString(a$,Chr(34),"'")
  a$=ReplaceString(a$,"("," ")

ProcedureReturn a$
EndProcedure
German ÄÖÜäöüß not done.
No warranty
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Convert/normalize special characters like "ê" to regular

Post by wilbert »

Keya wrote:None look particularly efficient though in regards to doing thousands of strings, so i still might try my luck at an asm version :)
The fastest method is probably a lookup table.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Convert/normalize special characters like "ê" to regular

Post by Keya »

Thankyou very much glomph! :)
wilbert wrote:
Keya wrote:None look particularly efficient though in regards to doing thousands of strings, so i still might try my luck at an asm version :)
The fastest method is probably a lookup table.
By lookup table do you mean the XLATB instruction? I only used that for the first time about a month ago so id like to try that again soon!
but my first try is the humble cmp, jmp, mov, x 1000 heehee :)
I try XLATB version after that! it will be intersting to check speeds
but i think i might have to make a "db" data for all 256 chars? not that that's an issue. I wish all asm tasks were this easy lol
i will post my code for your personal amusement shortly :D
Last edited by Keya on Thu Nov 12, 2015 10:37 am, edited 2 times in total.
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Convert/normalize special characters like "ê" to regular

Post by Keya »

Output of chars 180-255, where all the magic happens :)

Code: Select all

Before: ´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
Output: ´u¶·¸¹º»¼½¾¿AAAAAAACEEEEIIIIDNOOOOOxOUUUUYbBaaaaaaaceeeeiiiionooooo÷ouuuuyby
It DOESNT touch these 'extra special' characters however: ƒ†ŠŒŽšœžŸ¥§
The reason is because theyre all <180 so i was hoping to save execution time by being able to ignore that big chunk. But with my next version using a lookup table i'll translate them all

Tested on 5.40 LTS Win32 and Mac OSX 64, wilbert i used your sneaky register macro for the hybrid support, really nice trick makes life easy, thankingyou :) :)

I also divided the 180-255 range into halves to hopefully ~halve the number of cmp/jmp's on average, but i dont really know much about optimization (if that isnt obvious!) :)

Code: Select all

CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
  Macro rax : eax : EndMacro   ;thanks wilbert! "Sometimes the simple things in life..."
CompilerEndIf
  
  
Procedure NormalizeSpecialChars(*pbytes)   ;ptr to null-terminated string
  EnableASM        ;only uses eax and edx, no register preservation required
  mov rax, *pbytes
  
  !nextbyte:
     mov dl, [rax]   
   ! cmp dl, 181
   ! jae testspecialchar
   ! cmp dl, 0
   ! jz endproc
   CompilerIf #PB_Compiler_Unicode = 1
     add rax, 2
   CompilerElse
     inc rax
   CompilerEndIf
   ! jmp nextbyte
   
  !testspecialchar:
   ! cmp dl, 221
   ! jae Try2ndRange
   
  !Try1stRange:  
   
   ! cmp dl, 181
   ! jne Try192
   ! mov dl, 'u'
   ! jmp savenext
   
   !Try192:
   ! cmp dl, 192
   ! jnae endnextbyte
   ! cmp dl, 197
   ! ja Try198
   ! mov dl, 'A'
   ! jmp savenext
   
   !Try198:      
   ! cmp dl, 198
   ! jne Try199
   ! mov dl, 'A' 
   ! jmp savenext
   
   !Try199:
   ! cmp dl, 199
   ! jne Try200
   ! mov dl, 'C'
   ! jmp savenext
   
   !Try200:
   ! cmp dl, 204
   ! jae Try204
   ! mov dl, 'E'
   ! jmp savenext
   
   !Try204:
   ! cmp dl, 208
   ! jae Try208
   ! mov dl, 'I'
   ! jmp savenext
   
   !Try208:
   ! ja Try209
   ! mov dl, 'D'
   ! jmp savenext
   
   !Try209:
   ! cmp dl, 209
   ! jne Try210
   ! mov dl, 'N'
   ! jmp savenext
      
   !Try210:
   ! cmp dl, 215
   ! jae Try215
   ! mov dl, 'O'
   ! jmp savenext   
   
   !Try215:
   ! ja Try216
   ! mov dl, 'x'
   ! jmp savenext   
      
   !Try216:
   ! cmp dl, 216
   ! ja Try217
   ! mov dl, 'O'
   ! jmp savenext   
   
   !Try217: 
   ! mov dl, 'U'
   ! jmp savenext
   
   
  !Try2ndRange: 
   ! cmp dl, 221
   ! jne Try222
   ! mov dl, 'Y'
   ! jmp savenext
   
   !Try222:
   ! cmp dl, 222
   ! jne Try223
   ! mov dl, 'b'
   ! jmp savenext
   
   !Try223:
   ! cmp dl, 223
   ! jne Try224
   ! mov dl, 'B'
   ! jmp savenext
   
   !Try224:
   ! cmp dl, 230
   ! jae Try230
   ! mov dl, 'a'
   ! jmp savenext
   
   !Try230: 
   ! ja Try231
   ! mov dl, 'a' 
   ! jmp savenext
      
   !Try231:
   ! cmp dl, 232
   ! jae Try232
   ! mov dl, 'c'
   ! jmp savenext
   
   !Try232:
   ! cmp dl, 236
   ! jae Try236
   ! mov dl, 'e'
   ! jmp savenext
   
   !Try236:
   ! cmp dl, 240
   ! jae Try240
   ! mov dl, 'i'
   ! jmp savenext
   
   !Try240:
   ! ja Try241
   ! mov dl, 'o'
   ! jmp savenext
   
   !Try241:
   ! cmp dl, 241
   ! jne Try242
   ! mov dl, 'n'
   ! jmp savenext
   
   !Try242:
   ! cmp dl, 246
   ! jnbe Try248
   ! mov dl, 'o'
   ! jmp savenext
   
   !Try248:
   ! cmp dl, 248
   ! jne Try249
   ! mov dl, 'o'
   ! jmp savenext

   !Try249:
   ! cmp dl, 249
   ! jb endnextbyte
   ! cmp dl, 253
   ! jae Try253
   ! mov dl, 'u'
   ! jmp savenext
   
   !Try253:
   ! ja Try254
   ! mov dl, 'y'
   ! jmp savenext
   
   !Try254:
   ! cmp dl, 254
   ! ja Its255
   ! mov dl, 'b'
   ! jmp savenext
   !Its255:
   ! mov dl, 'y'
   
  !savenext:
     mov [rax], dl
  !endnextbyte:
   CompilerIf #PB_Compiler_Unicode = 1
     add rax, 2
   CompilerElse
     inc rax
   CompilerEndIf  
   ! jmp nextbyte
  !endproc:
  DisableASM
EndProcedure


For i = 180 To 255: sTest.s + Chr(i): Next i

Debug("Before: " + sTest)
NormalizeSpecialChars(@sTest)
Debug("Output: " + sTest)
btw is the above called a "jump table"?

i dont know if its slow or fast, i look fowrard to trying some speed tests later
ok i try XLATB lookup table version now :)
Last edited by Keya on Thu Nov 12, 2015 10:37 am, edited 3 times in total.
infratec
Always Here
Always Here
Posts: 7625
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: Convert/normalize special characters like "ê" to regular

Post by infratec »

Hi,

that's a lookup table: :wink:

Code: Select all

Procedure.a NormalizeASCII(AsciiByte.a)
  
  Protected Result.a
  
  DataSection
    NormStart:
    Data.a $B4, $B5, $B6, $B7, $B8, $B9, $BA, $BB, $BC, $BD, $BE, $BF, $41, $41, $41
    Data.a $41, $41, $41, $41, $43, $45, $45, $45, $45, $4C, $4C, $4C, $4C, $44, $4E
    Data.a $4F, $4F, $4F, $4F, $4F, $78, $4F, $55, $55, $55, $55, $59, $DE, $73, $61
    Data.a $61, $61, $61, $61, $61, $61, $63, $65, $65, $65, $65, $69, $69, $69, $69
    Data.a $F0, $6E, $6F, $6F, $6F, $6F, $6F, $F7, $6F, $75, $75, $75, $75, $79, $FE
    Data.a $79
  EndDataSection
  
  If AsciiByte > 179
    Result = PeekA(?NormStart + AsciiByte - 180)
  Else
    Result = AsciiByte
  EndIf
  
  ProcedureReturn Result
  
EndProcedure




Text$ = "´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
*Buffer = AllocateMemory(StringByteLength(Text$, #PB_Ascii))
If *Buffer
  PokeS(*Buffer, Text$, -1, #PB_Ascii|#PB_String_NoZero)
  
  For i = 0 To Len(Text$) - 1
    ;Debug Str(PeekA(*Buffer + i)) + ": " + Chr(NormalizeASCII(PeekA(*Buffer + i)))
    Norm$ + Chr(NormalizeASCII(PeekA(*Buffer + i)))
  Next i
  
  Debug Text$
  Debug Norm$
  
  FreeMemory(*Buffer)
EndIf
Now you can implement it in asm :mrgreen:

Bernd
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Convert/normalize special characters like "ê" to regular

Post by Keya »

thanks infratec!! :)
I just noticed yours when posting mine.

Hmm, this translation table makes my jump-based version look a little silly, lol. Good fun though!

Code: Select all

CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
  Macro rax : eax : EndMacro   ;thanks wilbert!
  Macro rbx : ebx : EndMacro   
  Macro rdx : edx : EndMacro   
CompilerEndIf
  

Procedure NormalizeSpecialChars(*pbytes, *ptable)   ;ptr to null-terminated string
  EnableASM                 ;uses *bx (needs preserving), and *ax, *dx
  mov rax, *ptable  
  mov rdx, *pbytes
  push rbx
  mov rbx, rax
  !nextbyte:
    mov al, [rdx]
  ! cmp al, 0
  ! je endproc
  ! xlatb
  mov [rdx], al
  CompilerIf #PB_Compiler_Unicode = 1
    add rdx, 2
  CompilerElse
    inc rdx
  CompilerEndIf
  ! jmp nextbyte  
  !endproc:  
  pop rbx
  DisableASM
EndProcedure

For i = 33 To 255: sTest.s + Chr(i): Next i

Debug("Before: " + sTest)
NormalizeSpecialChars(@sTest, ?NormalizedChars)
Debug("After : " + sTest)
End


DataSection
 NormalizedChars:
 !db 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
 !db ' !"#$%&',39,'()*+,-./0123456789:',59,'<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~' 
 !db 127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160
 !db 'i¢£¤¥¦§¨©ª«¬­®¯°±²³´u¶·¸¹º»¼½¾¿AAAAAAACEEEEIIIIDNOOOOOxOUUUUYbBaaaaaaaceeeeiiiionooooo÷ouuuuyby'
EndDataSection
I will try some speed tests later but my mind needs a break for now heehee, whew!!! :)
Last edited by Keya on Thu Nov 12, 2015 10:38 am, edited 2 times in total.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Convert/normalize special characters like "ê" to regular

Post by wilbert »

If you are going to support unicode, you might want to use a lookup table with 512 entries.
There's quite a few characters in range 256 - 511 that could be converted as well.
Windows (x64)
Raspberry Pi OS (Arm64)
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Convert/normalize special characters like "ê" to regular

Post by Keya »

i updated the code for both my versions to support unicode, which was simply changing "inc rax" to:

Code: Select all

   CompilerIf #PB_Compiler_Unicode = 1
     add rax, 2
   CompilerElse
     inc rax
   CompilerEndIf
seems to be working fine! ?
I only want to convert the ones from the normal Ascii range, so nothing 256+ :)
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Convert/normalize special characters like "ê" to regular

Post by Keya »

Here's some speed tests, results in milliseconds, it's just a length 255 string featuring all chars from 1-255, and repeated calls to the conversion function half a million (500,000) times

Code: Select all

1=glomph's convert()
2=Fc_TraitementChaine() by boddhi & falsam
3=infratec's NormalizeASCII()
4=Keya's Jumps
5=Keya's XLatb

Win-32
1 = 19980  (4976 when i updated its use of ReplaceString() to use #PB_String_InPlace!)
2 = 186750     ;<-- anomaly
3 = 46106
4 = 441
5 = 385

Win-64
1 = 18393
2 = 24343
3 = 45047
4 = 438
5 = 369

Linux-32
1 = 11281
2 = 22504
3 = 20521
4 = 333
5 = 275

Linux-64
1 = 11491
2 = 20323
3 = 18835
4 = 336
5 = 290

OSX-64
1 = 16463
2 = 15866
3 = 35567
4 = 436
5 = 365
there seems to be perhaps a compiler issue uncovered in Fc_TraitementChaine() function in 32bit Windows? i ran it several times to confirm its always dramatically slow on Win32 compared to all other OS. But anyway that's not a problem for me, and im happy with the results. Thankyou everyone for code and feedback! :)
IdeasVacuum
Always Here
Always Here
Posts: 6426
Joined: Fri Oct 23, 2009 2:33 am
Location: Wales, UK
Contact:

Re: Convert/normalize special characters like "ê" to regular

Post by IdeasVacuum »

This is a facinating post, interesting solutions. Can I ask why you would need to do this?
IdeasVacuum
If it sounds simple, you have not grasped the complexity.
User avatar
Demivec
Addict
Addict
Posts: 4270
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Convert/normalize special characters like "ê" to regular

Post by Demivec »

Depending on your uses (and the associated data) you may need to also deal with more than one code page's character mappings.
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Convert/normalize special characters like "ê" to regular

Post by Keya »

IdeasVacuum wrote:This is a facinating post, interesting solutions. Can I ask why you would need to do this?
Hi IdeasVacuum, just to extend on what Demivec said, in my case i need to detect by comparison (for simple example) words like "creme" and "crême" as being the same (and ive got lots of words to compare so it needs to be fairly efficient). There are other approaches i can think of such as string distance comparison, but then "crime" would be just as 'different/similar' to "creme" as "crême" (so that's not really an option), so in-place character-'normalizing' seemed the best efficient approach for my needs. :)

In my case i've also gone a step further (not shown in above code - actually the code is identical, just slightly different lookup table where i replaced "abc.." with "ABC.."), so that in the same single pass it converts all characters to Uppercase also for case-insensitive comparison (as i need to confirm "Creme" and "creme" are also the same word), and the lookup table makes that really easy to convert both "ê" and "e" to "E". So i get all characters case-desensitized + normalized in one pass, two birds one quick stone :)
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3943
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: Convert/normalize special characters like "ê" to regular

Post by wilbert »

Keya wrote:in my case i need to detect by comparison (for simple example) words like "creme" and "crême" as being the same (and ive got lots of words to compare so it needs to be fairly efficient).
Your approach is probably faster but sometimes the OS can help.
OSX for example is capable of doing a diacritic insensitive compare.
It looks like Windows might also be capable of this with CompareStringEx .
Windows (x64)
Raspberry Pi OS (Arm64)
Post Reply