Page 1 of 2
Remove diacritic chars
Posted: Wed Feb 26, 2025 4:20 pm
by jacdelad
Hi #PB_All,
I want to remove all dacritic chars from a string, e.g. á becomes a and so on. Now, I don't want to use a table or list to do so, but an elegant way. I found this PowerShell-Code, which works really great:
Code: Select all
function Remove-Diacritics {
param ([String]$src = [String]::Empty)
$normalized = $src.Normalize( [Text.NormalizationForm]::FormD )
$sb = new-object Text.StringBuilder
$normalized.ToCharArray() | ForEach-Object {
if( [Globalization.CharUnicodeInfo]::GetUnicodeCategory($PSItem) -ne [Globalization.UnicodeCategory]::NonSpacingMark) {
[void]$sb.Append($PSItem)
}
}
$sb.ToString()
}
It basically separates the diacritics from the letters via the Normalize function and then creates an array, where only everything but the diacritics are stored ([Globalization.UnicodeCategory]::NonSpacingMark is left out). Then it combines the array back into a string.
Now, I cannot access these functions, of course. Before I start writing this for myself, has anyone already done this and wants to share?
Re: Remove diacritic chars
Posted: Wed Feb 26, 2025 5:08 pm
by acreis
Re: Remove diacritic chars
Posted: Wed Feb 26, 2025 6:13 pm
by breeze4me
Something like this?
Code: Select all
#NormalizationOther = 0
#NormalizationC = 1
#NormalizationD = 2
#NormalizationKC = 5
#NormalizationKD = 6
; https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-getstringtypew
Import "Kernel32.lib"
GetStringTypeW.l(dwInfoType.l, *lpSrcStr, cchSrc.l, *lpCharType)
EndImport
; https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-normalizestring
Import "Normaliz.lib"
NormalizeString.l(NormForm.l, SrcString.s, cwSrcLength.l, *lpDstString, cwDstLength.l)
EndImport
Structure ca
c.c[0]
EndStructure
Procedure.s RemoveDiacritics(String$)
Protected i, len = Len(String$)
Protected *buffer.ca
Protected Result$
If len = 0 : ProcedureReturn "" : EndIf
len = NormalizeString(#NormalizationD, String$, -1, 0, 0)
If len > 0
*buffer = AllocateMemory(len * SizeOf(Character))
If *buffer
len = NormalizeString(#NormalizationD, String$, -1, *buffer, len)
If len > 1
len - 1
Protected Dim type.u(len)
If GetStringTypeW(#CT_CTYPE3, *buffer, len, @type())
For i = 0 To len
If type(i) & (#C3_DIACRITIC | #C3_NONSPACING) = 0
Result$ + Chr(*buffer\c[i])
EndIf
Next
EndIf
EndIf
FreeMemory(*buffer)
EndIf
EndIf
ProcedureReturn Result$
EndProcedure
Debug RemoveDiacritics("ñññ")
Debug RemoveDiacritics("ÅÅÅ")
Re: Remove diacritic chars
Posted: Wed Feb 26, 2025 8:21 pm
by idle
you can use the UTF16a.pbi it does the striping
https://github.com/idle-PB/UTF16
but I like breezes solution few less lines of code
Re: Remove diacritic chars
Posted: Wed Feb 26, 2025 8:51 pm
by jacdelad
Ah yes, thanks all of you. This is very helpful for search functions!
Also, I think this is worthy a thread in Tipps&Tricks.
And also, I don't have to do this by myself.

Re: Remove diacritic chars
Posted: Fri Feb 28, 2025 7:10 pm
by SMaag
I want to remove all dacritic chars from a string, e.g. á becomes a and so on. Now, I don't want to use a table or list to do so
Why not a simple ReplcaceChar Function with a Select Case to exchange the Chars.
- it is simple and everyone can see how it works, everyone can modify for personal use
- it is cross plattform
- it is fast
Re: Remove diacritic chars
Posted: Fri Feb 28, 2025 7:31 pm
by Mindphazer
I used such a function in one of my programs...
It's specific to french language, but can be easily adapted
Code: Select all
Procedure.s ReplaceAccent(Texte.s)
Protected i, Char.s
For i = 1 To Len(Texte)
Char = Mid(Texte, i, 1)
Select LCase(Char)
Case "é", "è", "ê", "ë"
ReplaceString(Texte, Char, "e", #PB_String_InPlace)
Case "à", "ä", "â"
ReplaceString(Texte, Char, "a", #PB_String_InPlace)
Case "ù", "ü", "û"
ReplaceString(Texte, Char, "u", #PB_String_InPlace)
Case "ï", "î"
ReplaceString(Texte, Char, "i", #PB_String_InPlace)
Case "ô", "ö"
ReplaceString(Texte, Char, "o", #PB_String_InPlace)
EndSelect
Next i
ProcedureReturn Texte
EndProcedure
Re: Remove diacritic chars
Posted: Fri Feb 28, 2025 7:53 pm
by AZJIO
Mindphazer wrote: Fri Feb 28, 2025 7:31 pm
I used such a function in one of my programs...
You calculate the length of the line on each iteration
Mindphazer wrote: Fri Feb 28, 2025 7:31 pm
Code: Select all
ReplaceString(Texte, Char, "e", #PB_String_InPlace)
You can use the "i" variable as the "StartPosition" parameter to make a shift to the desired position, instead of searching from the beginning.
You lose the letter register (case).
I used character-by-character string parsing
Code: Select all
Procedure ReplaceAccent(*c.Character)
If *c = 0 Or *c\c = 0
ProcedureReturn 0
EndIf
While *c\c
Select *c\c
Case 'é', 'è', 'ê', 'ë'
*c\c = 'e'
Case 'à', 'ä', 'â'
*c\c = 'a'
Case 'ù', 'ü', 'û'
*c\c = 'u'
Case 'ï', 'î'
*c\c = 'i'
Case 'ô', 'ö'
*c\c = 'o'
Case 'É', 'È', 'Ê', 'Ë'
*c\c = 'E'
Case 'À', 'Ä', 'Â'
*c\c = 'A'
Case 'Ù', 'Ü', 'Û'
*c\c = 'U'
Case 'Ï', 'Î'
*c\c = 'I'
Case 'Ô', 'Ö'
*c\c = 'O'
EndSelect
*c + SizeOf(Character)
Wend
EndProcedure
tmp.s="dféöке'É', 'È', 'Ê', 'Ë"
ReplaceAccent(@tmp)
Debug tmp
Can be specified by
ASCII-Codes ranges
Code: Select all
Procedure ReplaceAccent(*c.Character)
If *c = 0 Or *c\c = 0
ProcedureReturn 0
EndIf
While *c\c
Select *c\c
Case 'è' To 'ë' ; 232-235
*c\c = 'e'
Case 'à' To 'å' ; 224-229
*c\c = 'a'
Case 'ù' To 'ü' ; 249-252
*c\c = 'u'
Case 'ì' To 'ï' ; 236-239
*c\c = 'i'
Case 'ò' To 'ö' ; 242-246
*c\c = 'o'
Case 'È' To 'Ë' ; 200-203
*c\c = 'E'
Case 'À' To 'Å' ; 192-197
*c\c = 'A'
Case 'Ù' To 'Ü' ; 217-220
*c\c = 'U'
Case 'Ì' To 'Ï' ; 204-207
*c\c = 'I'
Case 'Ò' To 'Ö' ; 210-214
*c\c = 'O'
EndSelect
*c + SizeOf(Character)
Wend
EndProcedure
Define tmp.s="dféöке'É', 'È', 'Ê', 'Ë"
ReplaceAccent(@tmp)
Debug tmp
ASCII-Codes
Re: Remove diacritic chars
Posted: Fri Feb 28, 2025 8:40 pm
by SMaag
Code: Select all
Procedure.s ReplaceAccents(String$)
Protected *c.Character
*c = @String$
If *c
While *c\c
If *c\c >= 192 ; Accents start at 192 with 'À'
Select *c\c
Case 192 To 198 ; 'A' with different accents
*c\c = 'A'
Case 224 To 230
*c\c = 'a'
Case 200 To 203 ; 'E' with different accents
*c\c = 'E'
Case 232 To 235
*c\c = 'e'
Case 204 To 207 ; 'I' with different accents
*c\c = 'I'
Case 236 To 239
*c\c = 'i'
Case 210 To 214 ; 'O' with different accents
*c\c = 'O'
Case 242 To 246
*c\c = 'o'
Case 217 To 220 ; 'U' with different accents
*c\c = 'U'
Case 249 To 252
*c\c = 'u'
Case 221 ; 'Y' with different accents
*c\c = 'Y'
Case 253, 255
*c\c = 'y'
Case 209 ; 'N' with different accents
*c\c = 'N'
Case 241
*c\c = 'n'
EndSelect
EndIf
*c + SizeOf(Character)
Wend
EndIf
ProcedureReturn String$
EndProcedure
tmp.s="dféöке'É', 'È', 'Ê', 'Ë"
res$ = ReplaceAccents(tmp)
Debug res$
Re: Remove diacritic chars
Posted: Fri Feb 28, 2025 10:29 pm
by idle
I don't know why people don't just use the uft16 module.
https://github.com/idle-PB/UTF16/blob/main/UTF16a.pb
Unicode is complicated and believe it or not this module is a shortcut to process UTF16 and yes it's long but that's so it's Fast which is the main goal for string processing.
UTF16 StrCmp(s3,s4,#CASESIMPLE) 84 ms for 1,000,000
PB CompareMemoryString(@s3,@s4,#PB_String_NoCase) 101 ms for 1,000,000
UTF16 StrLCase / StrUcase 63 ms for 1,000,000
PB LCase / UCase 410 ms for 1,000,000
Re: Remove diacritic chars
Posted: Sat Mar 01, 2025 10:12 pm
by jacdelad
SMaag wrote: Fri Feb 28, 2025 7:10 pm
I want to remove all dacritic chars from a string, e.g. á becomes a and so on. Now, I don't want to use a table or list to do so
Why not a simple ReplcaceChar Function with a Select Case to exchange the Chars.
- it is simple and everyone can see how it works, everyone can modify for personal use
- it is cross plattform
- it is fast
Because it's a lot of chars to be replaced.
Re: Remove diacritic chars
Posted: Sat Mar 01, 2025 11:50 pm
by idle
the module also does it in place without copying the string
so the ucase lcase and strip functions are considerably faster and branchless compared to using select
maybe I should just turn it into a user lib so the eyes don't burn from lots of code and lookup tables
I invite anyone to do it faster, go ahead and make my ascii day!
Code: Select all
Procedure StrStripAccents_(*in.Unicode) ;changes the case of the string inplace
Protected *char.Unicode,tchar.u
*char = *in
While *char\u
If casemappingCD(*char\u)\cdl <> 0
*char\u = casemappingCD(*char\u)\cdl
EndIf
*char+2
Wend
EndProcedure
StrStripAccents = @StrStripAccents_() ;prototype so you pass in string without Address = no copy or need to parse the string for its length
Re: Remove diacritic chars
Posted: Sun Mar 02, 2025 7:09 am
by jacdelad
Hi idle,
after all this promotion I'll take a closer look especially at your module.

Re: Remove diacritic chars
Posted: Sun Mar 02, 2025 5:57 pm
by Zapman
I don't know if it can help, but I've made that :
Code: Select all
Procedure.s RemoveAccents(Text$)
; Function to remove accents from a string. By Zapman.
If Text$
Protected length = Len(Text$) * 2
Protected char$, ct, Result$
Protected NormalizedText$ = Space(Length)
;
Length = FoldString_(#MAP_COMPOSITE, @Text$, - 1, @NormalizedText$, Length) - 1
;
If Length > 0 And Length <> Len(Text$)
For ct = 1 To Length
char$ = Mid(NormalizedText$, ct, 1)
If Asc(char$) < 128
Result$ + char$
EndIf
Next
Else
Result$ = Text$
EndIf
;
ProcedureReturn Result$
EndIf
EndProcedure
The FoldString() function separates accented characters into two characters: the first without an accent and the second as a diacritic.
Then the string is examined character by character and the diacritics are removed.
Re: Remove diacritic chars
Posted: Mon Mar 03, 2025 11:50 am
by Zapman
The preceeding version of my procedure works well for most of european countries, but it won't work for russian, for example.
Here is an improved version which should work for every language:
Code: Select all
Procedure.s RemoveAccents(Text$)
; Function to remove accents from a string. By Zapman.
If Text$
Protected length = Len(Text$) * 2
Protected Ochar$, Nchar$, OPos, NPos, Result$
Protected NormalizedText$ = Space(Length)
;
; FoldString_() will replace each accentuated character by a pair of characters
; as this: (NonAccentuatedCharacter) + (diacritic)
Length = FoldString_(#MAP_COMPOSITE, @Text$, - 1, @NormalizedText$, Length) - 1
;
; Examine the result:
If Length > 0 And Length <> Len(Text$)
For NPos = 1 To Length
OPos + 1
Ochar$ = Mid(Text$, OPos, 1)
Nchar$ = Mid(NormalizedText$, NPos, 1)
Result$ + Nchar$
If Ochar$ <> Nchar$
; If the character has been replaced, it means that the following character
; contains the diacritic. Jump over it:
NPos + 1
EndIf
Next
Else
; No replacement has been made.
Result$ = Text$
EndIf
;
ProcedureReturn Result$
EndIf
EndProcedure
Debug "Demo: ''Hétérogène'' --> ''" + RemoveAccents("Hétérogène") + "''."
Here is the Microsoft documentation about FoldString():
https://learn.microsoft.com/en-us/windo ... oldstringw