Page 1 of 2

Remove diacritic chars

Posted: Wed Feb 26, 2025 4:20 pm
by jacdelad
Hi #PB_All,
I want to remove all dacritic chars from a string, e.g. á becomes a and so on. Now, I don't want to use a table or list to do so, but an elegant way. I found this PowerShell-Code, which works really great:

Code: Select all

function Remove-Diacritics {
    param ([String]$src = [String]::Empty)
    $normalized = $src.Normalize( [Text.NormalizationForm]::FormD )
    $sb = new-object Text.StringBuilder
    $normalized.ToCharArray() | ForEach-Object { 
        if( [Globalization.CharUnicodeInfo]::GetUnicodeCategory($PSItem) -ne [Globalization.UnicodeCategory]::NonSpacingMark) {
            [void]$sb.Append($PSItem)
        }
    }
    $sb.ToString()
}
It basically separates the diacritics from the letters via the Normalize function and then creates an array, where only everything but the diacritics are stored ([Globalization.UnicodeCategory]::NonSpacingMark is left out). Then it combines the array back into a string.

Now, I cannot access these functions, of course. Before I start writing this for myself, has anyone already done this and wants to share?

Re: Remove diacritic chars

Posted: Wed Feb 26, 2025 5:08 pm
by acreis

Re: Remove diacritic chars

Posted: Wed Feb 26, 2025 6:13 pm
by breeze4me
Something like this?

Code: Select all

#NormalizationOther = 0
#NormalizationC = 1
#NormalizationD = 2
#NormalizationKC = 5
#NormalizationKD = 6

; https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-getstringtypew
Import "Kernel32.lib"
  GetStringTypeW.l(dwInfoType.l, *lpSrcStr, cchSrc.l, *lpCharType)
EndImport

; https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-normalizestring
Import "Normaliz.lib"
  NormalizeString.l(NormForm.l, SrcString.s, cwSrcLength.l, *lpDstString, cwDstLength.l)
EndImport


Structure ca
  c.c[0]
EndStructure

Procedure.s RemoveDiacritics(String$)
  Protected i, len = Len(String$)
  Protected *buffer.ca
  Protected Result$
  
  If len = 0 : ProcedureReturn "" : EndIf
  
  len = NormalizeString(#NormalizationD, String$, -1, 0, 0)
  If len > 0
    *buffer = AllocateMemory(len * SizeOf(Character))
    If *buffer
      len = NormalizeString(#NormalizationD, String$, -1, *buffer, len)
      If len > 1
        len - 1
        
        Protected Dim type.u(len)
        
        If GetStringTypeW(#CT_CTYPE3, *buffer, len, @type())
          
          For i = 0 To len
            If type(i) & (#C3_DIACRITIC | #C3_NONSPACING) = 0
              Result$ + Chr(*buffer\c[i])
            EndIf
          Next
          
        EndIf
      EndIf
      FreeMemory(*buffer)
    EndIf
  EndIf
  ProcedureReturn Result$
EndProcedure

Debug RemoveDiacritics("ñññ")
Debug RemoveDiacritics("ÅÅÅ")


Re: Remove diacritic chars

Posted: Wed Feb 26, 2025 8:21 pm
by idle
you can use the UTF16a.pbi it does the striping
https://github.com/idle-PB/UTF16

but I like breezes solution few less lines of code

Re: Remove diacritic chars

Posted: Wed Feb 26, 2025 8:51 pm
by jacdelad
Ah yes, thanks all of you. This is very helpful for search functions!
Also, I think this is worthy a thread in Tipps&Tricks.
And also, I don't have to do this by myself. :mrgreen:

Re: Remove diacritic chars

Posted: Fri Feb 28, 2025 7:10 pm
by SMaag
I want to remove all dacritic chars from a string, e.g. á becomes a and so on. Now, I don't want to use a table or list to do so
Why not a simple ReplcaceChar Function with a Select Case to exchange the Chars.

- it is simple and everyone can see how it works, everyone can modify for personal use
- it is cross plattform
- it is fast

Re: Remove diacritic chars

Posted: Fri Feb 28, 2025 7:31 pm
by Mindphazer
I used such a function in one of my programs...
It's specific to french language, but can be easily adapted

Code: Select all

Procedure.s ReplaceAccent(Texte.s)
  Protected i, Char.s
  For i = 1 To Len(Texte)
    Char = Mid(Texte, i, 1)
    Select LCase(Char)
      Case "é", "è", "ê", "ë"
        ReplaceString(Texte, Char, "e", #PB_String_InPlace)
      Case "à", "ä", "â"
        ReplaceString(Texte, Char, "a", #PB_String_InPlace)
      Case "ù", "ü", "û"
        ReplaceString(Texte, Char, "u", #PB_String_InPlace)
      Case "ï", "î"
        ReplaceString(Texte, Char, "i", #PB_String_InPlace)
      Case "ô", "ö"
        ReplaceString(Texte, Char, "o", #PB_String_InPlace)
    EndSelect
  Next i
  ProcedureReturn Texte
EndProcedure  

Re: Remove diacritic chars

Posted: Fri Feb 28, 2025 7:53 pm
by AZJIO
Mindphazer wrote: Fri Feb 28, 2025 7:31 pm I used such a function in one of my programs...
Mindphazer wrote: Fri Feb 28, 2025 7:31 pm

Code: Select all

For i = 1 To Len(Texte)
You calculate the length of the line on each iteration
Mindphazer wrote: Fri Feb 28, 2025 7:31 pm

Code: Select all

ReplaceString(Texte, Char, "e", #PB_String_InPlace)
You can use the "i" variable as the "StartPosition" parameter to make a shift to the desired position, instead of searching from the beginning.
Mindphazer wrote: Fri Feb 28, 2025 7:31 pm

Code: Select all

LCase(Char)
You lose the letter register (case).

I used character-by-character string parsing

Code: Select all

Procedure ReplaceAccent(*c.Character)
	If *c = 0 Or *c\c = 0
		ProcedureReturn 0
	EndIf
	
	While *c\c
		Select *c\c
			Case 'é', 'è', 'ê', 'ë'
				*c\c = 'e'
			Case 'à', 'ä', 'â'
				*c\c = 'a'
			Case 'ù', 'ü', 'û'
				*c\c = 'u'
			Case 'ï', 'î'
				*c\c = 'i'
			Case 'ô', 'ö'
				*c\c = 'o'
			Case 'É', 'È', 'Ê', 'Ë'
				*c\c = 'E'
			Case 'À', 'Ä', 'Â'
				*c\c = 'A'
			Case 'Ù', 'Ü', 'Û'
				*c\c = 'U'
			Case 'Ï', 'Î'
				*c\c = 'I'
			Case 'Ô', 'Ö'
				*c\c = 'O'
		EndSelect
		*c + SizeOf(Character)
	Wend
EndProcedure

tmp.s="dféöке'É', 'È', 'Ê', 'Ë"
ReplaceAccent(@tmp)
Debug tmp
Can be specified by ASCII-Codes ranges

Code: Select all

Procedure ReplaceAccent(*c.Character)
	If *c = 0 Or *c\c = 0
		ProcedureReturn 0
	EndIf
	
	While *c\c
		Select *c\c
			Case 'è' To 'ë'	; 232-235
				*c\c = 'e'
			Case 'à' To 'å' ; 224-229
				*c\c = 'a'
			Case 'ù' To 'ü'	; 249-252
				*c\c = 'u'
			Case 'ì' To 'ï'	; 236-239
				*c\c = 'i'
			Case 'ò' To 'ö'	; 242-246
				*c\c = 'o'
			Case 'È' To 'Ë' ; 200-203
				*c\c = 'E'
			Case 'À' To 'Å' ; 192-197
				*c\c = 'A'
			Case 'Ù' To 'Ü' ; 217-220
				*c\c = 'U'
			Case 'Ì' To 'Ï' ; 204-207
				*c\c = 'I'
			Case 'Ò' To 'Ö' ; 210-214
				*c\c = 'O'
		EndSelect
		*c + SizeOf(Character)
	Wend
EndProcedure

Define tmp.s="dféöке'É', 'È', 'Ê', 'Ë"
ReplaceAccent(@tmp)
Debug tmp
ASCII-Codes

Re: Remove diacritic chars

Posted: Fri Feb 28, 2025 8:40 pm
by SMaag

Code: Select all


Procedure.s ReplaceAccents(String$)
  Protected *c.Character
  
  *c = @String$  
  If *c 	
    While *c\c
      If *c\c >= 192   ; Accents start at 192 with 'À'           
        Select *c\c
          Case 192 To 198   ; 'A' with different accents
            *c\c = 'A'
          Case  224 To 230
            *c\c = 'a'
            
          Case 200 To 203   ; 'E' with different accents
            *c\c = 'E'           
          Case 232 To 235
            *c\c = 'e'
            
          Case 204 To 207   ; 'I' with different accents
            *c\c = 'I'
          Case 236 To 239
            *c\c = 'i'
            
          Case 210 To 214   ; 'O' with different accents
    				*c\c = 'O'
          Case 242 To 246
    				*c\c = 'o'
           
          Case 217 To 220   ; 'U' with different accents
    				*c\c = 'U'
          Case 249 To 252
    				*c\c = 'u'
    				
    		  Case 221          ; 'Y' with different accents
    			  *c\c = 'Y'
    			Case 253, 255
     				*c\c = 'y'
     				
     			Case 209          ;  'N' with different accents 
     				*c\c = 'N'
     			Case 241
      	    *c\c = 'n'
    			  
     		EndSelect
    	EndIf     	
    	*c + SizeOf(Character)
  	Wend
  EndIf
  ProcedureReturn String$
EndProcedure

tmp.s="dféöке'É', 'È', 'Ê', 'Ë"
res$ = ReplaceAccents(tmp)
Debug res$

Re: Remove diacritic chars

Posted: Fri Feb 28, 2025 10:29 pm
by idle
I don't know why people don't just use the uft16 module.
https://github.com/idle-PB/UTF16/blob/main/UTF16a.pb

Unicode is complicated and believe it or not this module is a shortcut to process UTF16 and yes it's long but that's so it's Fast which is the main goal for string processing.

UTF16 StrCmp(s3,s4,#CASESIMPLE) 84 ms for 1,000,000
PB CompareMemoryString(@s3,@s4,#PB_String_NoCase) 101 ms for 1,000,000

UTF16 StrLCase / StrUcase 63 ms for 1,000,000
PB LCase / UCase 410 ms for 1,000,000

Re: Remove diacritic chars

Posted: Sat Mar 01, 2025 10:12 pm
by jacdelad
SMaag wrote: Fri Feb 28, 2025 7:10 pm
I want to remove all dacritic chars from a string, e.g. á becomes a and so on. Now, I don't want to use a table or list to do so
Why not a simple ReplcaceChar Function with a Select Case to exchange the Chars.

- it is simple and everyone can see how it works, everyone can modify for personal use
- it is cross plattform
- it is fast
Because it's a lot of chars to be replaced.

Re: Remove diacritic chars

Posted: Sat Mar 01, 2025 11:50 pm
by idle
the module also does it in place without copying the string
so the ucase lcase and strip functions are considerably faster and branchless compared to using select
maybe I should just turn it into a user lib so the eyes don't burn from lots of code and lookup tables
I invite anyone to do it faster, go ahead and make my ascii day!

Code: Select all

Procedure StrStripAccents_(*in.Unicode)  ;changes the case of the string inplace 
    
    Protected *char.Unicode,tchar.u 
    *char = *in 
    While *char\u  
      If casemappingCD(*char\u)\cdl <> 0 
        *char\u = casemappingCD(*char\u)\cdl 
      EndIf 
      *char+2 
    Wend 
    
  EndProcedure  
  
  StrStripAccents = @StrStripAccents_() ;prototype so you pass in string without Address = no copy or need to parse the string for its length 


Re: Remove diacritic chars

Posted: Sun Mar 02, 2025 7:09 am
by jacdelad
Hi idle,
after all this promotion I'll take a closer look especially at your module. :D

Re: Remove diacritic chars

Posted: Sun Mar 02, 2025 5:57 pm
by Zapman
I don't know if it can help, but I've made that :

Code: Select all

  Procedure.s RemoveAccents(Text$)
    ; Function to remove accents from a string. By Zapman.
    If Text$
      Protected length = Len(Text$) * 2
      Protected char$, ct, Result$
      Protected NormalizedText$ = Space(Length)
      ;
      Length = FoldString_(#MAP_COMPOSITE, @Text$, - 1, @NormalizedText$, Length) - 1
      ;
      If Length > 0 And Length <> Len(Text$)
        For ct = 1 To Length
          char$ = Mid(NormalizedText$, ct, 1)
          If Asc(char$) < 128
            Result$ + char$
          EndIf
        Next
      Else
        Result$ = Text$
      EndIf
      ;
      ProcedureReturn Result$
    EndIf
  EndProcedure
The FoldString() function separates accented characters into two characters: the first without an accent and the second as a diacritic.
Then the string is examined character by character and the diacritics are removed.

Re: Remove diacritic chars

Posted: Mon Mar 03, 2025 11:50 am
by Zapman
The preceeding version of my procedure works well for most of european countries, but it won't work for russian, for example.
Here is an improved version which should work for every language:

Code: Select all

Procedure.s RemoveAccents(Text$)
    ; Function to remove accents from a string. By Zapman.
    If Text$
      Protected length = Len(Text$) * 2
      Protected Ochar$, Nchar$, OPos, NPos, Result$
      Protected NormalizedText$ = Space(Length)
      ;
      ; FoldString_() will replace each accentuated character by a pair of characters
      ; as this: (NonAccentuatedCharacter) + (diacritic)
      Length = FoldString_(#MAP_COMPOSITE, @Text$, - 1, @NormalizedText$, Length) - 1
      ;
      ; Examine the result:
      If Length > 0 And Length <> Len(Text$)
        For NPos = 1 To Length
          OPos + 1
          Ochar$ = Mid(Text$, OPos, 1)
          Nchar$ = Mid(NormalizedText$, NPos, 1)
          Result$ + Nchar$
          If Ochar$ <> Nchar$
            ; If the character has been replaced, it means that the following character
            ; contains the diacritic. Jump over it:
            NPos + 1
          EndIf
        Next
      Else
        ; No replacement has been made.
        Result$ = Text$
      EndIf
      ;
      ProcedureReturn Result$
    EndIf
  EndProcedure
  
  Debug "Demo: ''Hétérogène'' --> ''" + RemoveAccents("Hétérogène") + "''."
Here is the Microsoft documentation about FoldString():
https://learn.microsoft.com/en-us/windo ... oldstringw