Remove diacritic chars

Just starting out? Need help? Post your questions and find answers here.
User avatar
jacdelad
Addict
Addict
Posts: 2010
Joined: Wed Feb 03, 2021 12:46 pm
Location: Riesa

Remove diacritic chars

Post by jacdelad »

Hi #PB_All,
I want to remove all dacritic chars from a string, e.g. á becomes a and so on. Now, I don't want to use a table or list to do so, but an elegant way. I found this PowerShell-Code, which works really great:

Code: Select all

function Remove-Diacritics {
    param ([String]$src = [String]::Empty)
    $normalized = $src.Normalize( [Text.NormalizationForm]::FormD )
    $sb = new-object Text.StringBuilder
    $normalized.ToCharArray() | ForEach-Object { 
        if( [Globalization.CharUnicodeInfo]::GetUnicodeCategory($PSItem) -ne [Globalization.UnicodeCategory]::NonSpacingMark) {
            [void]$sb.Append($PSItem)
        }
    }
    $sb.ToString()
}
It basically separates the diacritics from the letters via the Normalize function and then creates an array, where only everything but the diacritics are stored ([Globalization.UnicodeCategory]::NonSpacingMark is left out). Then it combines the array back into a string.

Now, I cannot access these functions, of course. Before I start writing this for myself, has anyone already done this and wants to share?
Good morning, that's a nice tnetennba!

PureBasic 6.21/Windows 11 x64/Ryzen 7900X/32GB RAM/3TB SSD
Synology DS1821+/DX517, 130.9TB+50.8TB+2TB SSD
acreis
Enthusiast
Enthusiast
Posts: 204
Joined: Fri Jun 01, 2012 12:20 am

Re: Remove diacritic chars

Post by acreis »

breeze4me
Enthusiast
Enthusiast
Posts: 633
Joined: Thu Mar 09, 2006 9:24 am
Location: S. Kor

Re: Remove diacritic chars

Post by breeze4me »

Something like this?

Code: Select all

#NormalizationOther = 0
#NormalizationC = 1
#NormalizationD = 2
#NormalizationKC = 5
#NormalizationKD = 6

; https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-getstringtypew
Import "Kernel32.lib"
  GetStringTypeW.l(dwInfoType.l, *lpSrcStr, cchSrc.l, *lpCharType)
EndImport

; https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-normalizestring
Import "Normaliz.lib"
  NormalizeString.l(NormForm.l, SrcString.s, cwSrcLength.l, *lpDstString, cwDstLength.l)
EndImport


Structure ca
  c.c[0]
EndStructure

Procedure.s RemoveDiacritics(String$)
  Protected i, len = Len(String$)
  Protected *buffer.ca
  Protected Result$
  
  If len = 0 : ProcedureReturn "" : EndIf
  
  len = NormalizeString(#NormalizationD, String$, -1, 0, 0)
  If len > 0
    *buffer = AllocateMemory(len * SizeOf(Character))
    If *buffer
      len = NormalizeString(#NormalizationD, String$, -1, *buffer, len)
      If len > 1
        len - 1
        
        Protected Dim type.u(len)
        
        If GetStringTypeW(#CT_CTYPE3, *buffer, len, @type())
          
          For i = 0 To len
            If type(i) & (#C3_DIACRITIC | #C3_NONSPACING) = 0
              Result$ + Chr(*buffer\c[i])
            EndIf
          Next
          
        EndIf
      EndIf
      FreeMemory(*buffer)
    EndIf
  EndIf
  ProcedureReturn Result$
EndProcedure

Debug RemoveDiacritics("ñññ")
Debug RemoveDiacritics("ÅÅÅ")

User avatar
idle
Always Here
Always Here
Posts: 5915
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Remove diacritic chars

Post by idle »

you can use the UTF16a.pbi it does the striping
https://github.com/idle-PB/UTF16

but I like breezes solution few less lines of code
User avatar
jacdelad
Addict
Addict
Posts: 2010
Joined: Wed Feb 03, 2021 12:46 pm
Location: Riesa

Re: Remove diacritic chars

Post by jacdelad »

Ah yes, thanks all of you. This is very helpful for search functions!
Also, I think this is worthy a thread in Tipps&Tricks.
And also, I don't have to do this by myself. :mrgreen:
Good morning, that's a nice tnetennba!

PureBasic 6.21/Windows 11 x64/Ryzen 7900X/32GB RAM/3TB SSD
Synology DS1821+/DX517, 130.9TB+50.8TB+2TB SSD
SMaag
Enthusiast
Enthusiast
Posts: 325
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: Remove diacritic chars

Post by SMaag »

I want to remove all dacritic chars from a string, e.g. á becomes a and so on. Now, I don't want to use a table or list to do so
Why not a simple ReplcaceChar Function with a Select Case to exchange the Chars.

- it is simple and everyone can see how it works, everyone can modify for personal use
- it is cross plattform
- it is fast
User avatar
Mindphazer
Enthusiast
Enthusiast
Posts: 460
Joined: Mon Sep 10, 2012 10:41 am
Location: Savoie

Re: Remove diacritic chars

Post by Mindphazer »

I used such a function in one of my programs...
It's specific to french language, but can be easily adapted

Code: Select all

Procedure.s ReplaceAccent(Texte.s)
  Protected i, Char.s
  For i = 1 To Len(Texte)
    Char = Mid(Texte, i, 1)
    Select LCase(Char)
      Case "é", "è", "ê", "ë"
        ReplaceString(Texte, Char, "e", #PB_String_InPlace)
      Case "à", "ä", "â"
        ReplaceString(Texte, Char, "a", #PB_String_InPlace)
      Case "ù", "ü", "û"
        ReplaceString(Texte, Char, "u", #PB_String_InPlace)
      Case "ï", "î"
        ReplaceString(Texte, Char, "i", #PB_String_InPlace)
      Case "ô", "ö"
        ReplaceString(Texte, Char, "o", #PB_String_InPlace)
    EndSelect
  Next i
  ProcedureReturn Texte
EndProcedure  
MacBook Pro 16" M4 Pro - 24 Gb - MacOS 15.4.1 - Iphone 15 Pro Max - iPad at home
...and unfortunately... Windows at work...
AZJIO
Addict
Addict
Posts: 2191
Joined: Sun May 14, 2017 1:48 am

Re: Remove diacritic chars

Post by AZJIO »

Mindphazer wrote: Fri Feb 28, 2025 7:31 pm I used such a function in one of my programs...
Mindphazer wrote: Fri Feb 28, 2025 7:31 pm

Code: Select all

For i = 1 To Len(Texte)
You calculate the length of the line on each iteration
Mindphazer wrote: Fri Feb 28, 2025 7:31 pm

Code: Select all

ReplaceString(Texte, Char, "e", #PB_String_InPlace)
You can use the "i" variable as the "StartPosition" parameter to make a shift to the desired position, instead of searching from the beginning.
Mindphazer wrote: Fri Feb 28, 2025 7:31 pm

Code: Select all

LCase(Char)
You lose the letter register (case).

I used character-by-character string parsing

Code: Select all

Procedure ReplaceAccent(*c.Character)
	If *c = 0 Or *c\c = 0
		ProcedureReturn 0
	EndIf
	
	While *c\c
		Select *c\c
			Case 'é', 'è', 'ê', 'ë'
				*c\c = 'e'
			Case 'à', 'ä', 'â'
				*c\c = 'a'
			Case 'ù', 'ü', 'û'
				*c\c = 'u'
			Case 'ï', 'î'
				*c\c = 'i'
			Case 'ô', 'ö'
				*c\c = 'o'
			Case 'É', 'È', 'Ê', 'Ë'
				*c\c = 'E'
			Case 'À', 'Ä', 'Â'
				*c\c = 'A'
			Case 'Ù', 'Ü', 'Û'
				*c\c = 'U'
			Case 'Ï', 'Î'
				*c\c = 'I'
			Case 'Ô', 'Ö'
				*c\c = 'O'
		EndSelect
		*c + SizeOf(Character)
	Wend
EndProcedure

tmp.s="dféöке'É', 'È', 'Ê', 'Ë"
ReplaceAccent(@tmp)
Debug tmp
Can be specified by ASCII-Codes ranges

Code: Select all

Procedure ReplaceAccent(*c.Character)
	If *c = 0 Or *c\c = 0
		ProcedureReturn 0
	EndIf
	
	While *c\c
		Select *c\c
			Case 'è' To 'ë'	; 232-235
				*c\c = 'e'
			Case 'à' To 'å' ; 224-229
				*c\c = 'a'
			Case 'ù' To 'ü'	; 249-252
				*c\c = 'u'
			Case 'ì' To 'ï'	; 236-239
				*c\c = 'i'
			Case 'ò' To 'ö'	; 242-246
				*c\c = 'o'
			Case 'È' To 'Ë' ; 200-203
				*c\c = 'E'
			Case 'À' To 'Å' ; 192-197
				*c\c = 'A'
			Case 'Ù' To 'Ü' ; 217-220
				*c\c = 'U'
			Case 'Ì' To 'Ï' ; 204-207
				*c\c = 'I'
			Case 'Ò' To 'Ö' ; 210-214
				*c\c = 'O'
		EndSelect
		*c + SizeOf(Character)
	Wend
EndProcedure

Define tmp.s="dféöке'É', 'È', 'Ê', 'Ë"
ReplaceAccent(@tmp)
Debug tmp
ASCII-Codes
SMaag
Enthusiast
Enthusiast
Posts: 325
Joined: Sat Jan 14, 2023 6:55 pm
Location: Bavaria/Germany

Re: Remove diacritic chars

Post by SMaag »

Code: Select all


Procedure.s ReplaceAccents(String$)
  Protected *c.Character
  
  *c = @String$  
  If *c 	
    While *c\c
      If *c\c >= 192   ; Accents start at 192 with 'À'           
        Select *c\c
          Case 192 To 198   ; 'A' with different accents
            *c\c = 'A'
          Case  224 To 230
            *c\c = 'a'
            
          Case 200 To 203   ; 'E' with different accents
            *c\c = 'E'           
          Case 232 To 235
            *c\c = 'e'
            
          Case 204 To 207   ; 'I' with different accents
            *c\c = 'I'
          Case 236 To 239
            *c\c = 'i'
            
          Case 210 To 214   ; 'O' with different accents
    				*c\c = 'O'
          Case 242 To 246
    				*c\c = 'o'
           
          Case 217 To 220   ; 'U' with different accents
    				*c\c = 'U'
          Case 249 To 252
    				*c\c = 'u'
    				
    		  Case 221          ; 'Y' with different accents
    			  *c\c = 'Y'
    			Case 253, 255
     				*c\c = 'y'
     				
     			Case 209          ;  'N' with different accents 
     				*c\c = 'N'
     			Case 241
      	    *c\c = 'n'
    			  
     		EndSelect
    	EndIf     	
    	*c + SizeOf(Character)
  	Wend
  EndIf
  ProcedureReturn String$
EndProcedure

tmp.s="dféöке'É', 'È', 'Ê', 'Ë"
res$ = ReplaceAccents(tmp)
Debug res$
User avatar
idle
Always Here
Always Here
Posts: 5915
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Remove diacritic chars

Post by idle »

I don't know why people don't just use the uft16 module.
https://github.com/idle-PB/UTF16/blob/main/UTF16a.pb

Unicode is complicated and believe it or not this module is a shortcut to process UTF16 and yes it's long but that's so it's Fast which is the main goal for string processing.

UTF16 StrCmp(s3,s4,#CASESIMPLE) 84 ms for 1,000,000
PB CompareMemoryString(@s3,@s4,#PB_String_NoCase) 101 ms for 1,000,000

UTF16 StrLCase / StrUcase 63 ms for 1,000,000
PB LCase / UCase 410 ms for 1,000,000
User avatar
jacdelad
Addict
Addict
Posts: 2010
Joined: Wed Feb 03, 2021 12:46 pm
Location: Riesa

Re: Remove diacritic chars

Post by jacdelad »

SMaag wrote: Fri Feb 28, 2025 7:10 pm
I want to remove all dacritic chars from a string, e.g. á becomes a and so on. Now, I don't want to use a table or list to do so
Why not a simple ReplcaceChar Function with a Select Case to exchange the Chars.

- it is simple and everyone can see how it works, everyone can modify for personal use
- it is cross plattform
- it is fast
Because it's a lot of chars to be replaced.
Good morning, that's a nice tnetennba!

PureBasic 6.21/Windows 11 x64/Ryzen 7900X/32GB RAM/3TB SSD
Synology DS1821+/DX517, 130.9TB+50.8TB+2TB SSD
User avatar
idle
Always Here
Always Here
Posts: 5915
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: Remove diacritic chars

Post by idle »

the module also does it in place without copying the string
so the ucase lcase and strip functions are considerably faster and branchless compared to using select
maybe I should just turn it into a user lib so the eyes don't burn from lots of code and lookup tables
I invite anyone to do it faster, go ahead and make my ascii day!

Code: Select all

Procedure StrStripAccents_(*in.Unicode)  ;changes the case of the string inplace 
    
    Protected *char.Unicode,tchar.u 
    *char = *in 
    While *char\u  
      If casemappingCD(*char\u)\cdl <> 0 
        *char\u = casemappingCD(*char\u)\cdl 
      EndIf 
      *char+2 
    Wend 
    
  EndProcedure  
  
  StrStripAccents = @StrStripAccents_() ;prototype so you pass in string without Address = no copy or need to parse the string for its length 

User avatar
jacdelad
Addict
Addict
Posts: 2010
Joined: Wed Feb 03, 2021 12:46 pm
Location: Riesa

Re: Remove diacritic chars

Post by jacdelad »

Hi idle,
after all this promotion I'll take a closer look especially at your module. :D
Good morning, that's a nice tnetennba!

PureBasic 6.21/Windows 11 x64/Ryzen 7900X/32GB RAM/3TB SSD
Synology DS1821+/DX517, 130.9TB+50.8TB+2TB SSD
User avatar
Zapman
Enthusiast
Enthusiast
Posts: 205
Joined: Tue Jan 07, 2020 7:27 pm

Re: Remove diacritic chars

Post by Zapman »

I don't know if it can help, but I've made that :

Code: Select all

  Procedure.s RemoveAccents(Text$)
    ; Function to remove accents from a string. By Zapman.
    If Text$
      Protected length = Len(Text$) * 2
      Protected char$, ct, Result$
      Protected NormalizedText$ = Space(Length)
      ;
      Length = FoldString_(#MAP_COMPOSITE, @Text$, - 1, @NormalizedText$, Length) - 1
      ;
      If Length > 0 And Length <> Len(Text$)
        For ct = 1 To Length
          char$ = Mid(NormalizedText$, ct, 1)
          If Asc(char$) < 128
            Result$ + char$
          EndIf
        Next
      Else
        Result$ = Text$
      EndIf
      ;
      ProcedureReturn Result$
    EndIf
  EndProcedure
The FoldString() function separates accented characters into two characters: the first without an accent and the second as a diacritic.
Then the string is examined character by character and the diacritics are removed.
User avatar
Zapman
Enthusiast
Enthusiast
Posts: 205
Joined: Tue Jan 07, 2020 7:27 pm

Re: Remove diacritic chars

Post by Zapman »

The preceeding version of my procedure works well for most of european countries, but it won't work for russian, for example.
Here is an improved version which should work for every language:

Code: Select all

Procedure.s RemoveAccents(Text$)
    ; Function to remove accents from a string. By Zapman.
    If Text$
      Protected length = Len(Text$) * 2
      Protected Ochar$, Nchar$, OPos, NPos, Result$
      Protected NormalizedText$ = Space(Length)
      ;
      ; FoldString_() will replace each accentuated character by a pair of characters
      ; as this: (NonAccentuatedCharacter) + (diacritic)
      Length = FoldString_(#MAP_COMPOSITE, @Text$, - 1, @NormalizedText$, Length) - 1
      ;
      ; Examine the result:
      If Length > 0 And Length <> Len(Text$)
        For NPos = 1 To Length
          OPos + 1
          Ochar$ = Mid(Text$, OPos, 1)
          Nchar$ = Mid(NormalizedText$, NPos, 1)
          Result$ + Nchar$
          If Ochar$ <> Nchar$
            ; If the character has been replaced, it means that the following character
            ; contains the diacritic. Jump over it:
            NPos + 1
          EndIf
        Next
      Else
        ; No replacement has been made.
        Result$ = Text$
      EndIf
      ;
      ProcedureReturn Result$
    EndIf
  EndProcedure
  
  Debug "Demo: ''Hétérogène'' --> ''" + RemoveAccents("Hétérogène") + "''."
Here is the Microsoft documentation about FoldString():
https://learn.microsoft.com/en-us/windo ... oldstringw
Post Reply