Page 1 of 1

Conversion of Codepages

Posted: Sat Nov 18, 2023 9:21 pm
by miskox
Hello all!

I would like to convert a string in CP1250 to CP437 so that diacritic characters are replaced with their 'basic' characters:

For example:

Code: Select all

Š would become S
Č would become C
Ü would become U
.
.
.
the same goes for lowercase characters
I guess the correct MS API would be one of MultiByteToWideChar and WideCharToMultiByte.

Code: Select all

REM input file has CP1250 characters in it and would be opened as #PB_ASCII
input_string$="ČŠŽčšž"
...conversion code here...
output_string$="CSZcsz"
Any ideas?

Thanks.
Saso

Re: Conversion of Codepages

Posted: Sun Nov 19, 2023 6:20 am
by Demivec
No need to use API. You could instead do a simple lookup and replace process between the corresponding code page values.

It would only take an array of 256 bytes that is indexed with the source code page value, and it would contain the destination code page value you want to translate to.

Code: Select all

;Sample code outline of conversion process
Dim CP1250_437.b(255) ;holds translation equivalents

Structure byte_array
  b.b[0]
 EndStructure
 
 Define *buffer. byte_array ;buffer ptr
 
;Initialize array with the desired CP437 value for each of the 256 CP1250 values
;This can be done with values from a data section or file for instance.
;i.e. CP1250 value $8A would become CP437 value $53
CP1250_437($8A) = $53
CP1250_437($C8) = $43
CP1250_437($DC) = $55


;Read file string data into a memory buffer
 
;For each byte in the buffer replace it with the translated value
*buffer\b[index] = CP1250_437.b(*buffer\b[index] )

Re: Conversion of Codepages

Posted: Sun Nov 19, 2023 6:40 pm
by miskox
Thanks. Will check this solution.

Saso

Re: Conversion of Codepages

Posted: Sun Nov 19, 2023 10:20 pm
by juergenkulow
cp1250 to Unicode table
cp437_DOSLatinUS to Unicode table

Code: Select all

; Gernerate CP1250_437 from Unicode CP1250.TXT and CP347.TXT 
Dim CP1250.Unicode(255)
File=OpenFile(#PB_Any,"/tmp/CP1250.TXT",#PB_Ascii) ; Please Download and adapt.
While Not Eof(File)
  s.s=ReadString(File)
  first=Asc(Left(s,1))
  If first>='0' And first<='9'
    ; Debug Mid(s,3,2)+":"+Mid(s,8,4)
    If "    "=Mid(s,8,4)
      CP1250(Val("$"+Mid(s,3,2)))\u=' '
    Else
      CP1250(Val("$"+Mid(s,3,2)))\u=Val("$"+Mid(s,8,4))
    EndIf   
  EndIf   
Wend
Dim CP437.Unicode(255)
File=OpenFile(#PB_Any,"/tmp/CP437.TXT",#PB_Ascii) ; Please Download and adapt.
While Not Eof(File)
  s.s=ReadString(File)
  first=Asc(Left(s,1))
  If first>='0' And first<='9'
    ; Debug Mid(s,3,2)+":"+Mid(s,8,4)
    If "    "=Mid(s,8,4)
      CP437(Val("$"+Mid(s,3,2)))\u=' '
    Else
      CP437(Val("$"+Mid(s,3,2)))\u=Val("$"+Mid(s,8,4))
    EndIf   
  EndIf   
Wend

t.s=""
Dim CP1250_437.a(255) ;.a 0..255
For i=0 To 255
  If CP1250(i)\u=CP437(i)\u
    CP1250_437(i)=i
    If i>35 : t+Chr(CP437(i)\u) :EndIf 
  Else
    j=128
    While j<=255 And CP1250(i)\u<>CP437(j)\u 
      j+1
    Wend 
    If j<=255
      CP1250_437(i)=j
      t+Chr(CP437(j)\u)
    Else
      CP1250_437(i)='X'
      t+"X"
    EndIf   
  EndIf 
Next   
  
; Debug t
s="Dim CP1250_437.a(255) "+#CRLF$
s+"For i=0 to 127 : CP1250_437(i)=i : Next "+#CRLF$
For i=128 To 255
  If CP1250_437(i)<>'X'
    s+"CP1250_437("+Str(i)+")="+CP1250_437(i)+" ; " +Chr(CP1250(i)\u)+#CRLF$
  Else
    s+"CP1250_437("+Str(i)+")='X' ; "+Chr(CP1250(i)\u)+#CRLF$
  EndIf  
Next 

Debug s

DataSection
  CP1250:
  Data.s " !"+Chr(34)+"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€ ‚ „…†‡ ‰Š‹ŚŤŽŹ ‘’“”•–— ™š›śťžź ˇ˘Ł¤Ą¦§¨©Ş«¬­®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ˙"
  CP437:
  Data.s " !"+Chr(34)+"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■ "  
  CP1250_437: 
  Data.s " !"+Chr(34)+"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX«¬XXX°±XXXµX·XXX»XXXXXXXXÄXXÇXÉXXXXXXXXXXXXÖXXXXXÜXXßXáâXäXXçXéXëXíîXXXXóôXö÷XXúXüXXX"
  
EndDataSection

Code: Select all

; Modified generated code 
Dim CP1250_437.a(255) ; 
For i=0 To 127 : CP1250_437(i)=i : Next 
CP1250_437(128)='X' ; €
CP1250_437(129)='X' ;  
CP1250_437(130)='X' ; ‚
CP1250_437(131)='X' ;  
CP1250_437(132)='X' ; „
CP1250_437(133)='X' ; …
CP1250_437(134)='X' ; †
CP1250_437(135)='X' ; ‡
CP1250_437(136)='X' ;  
CP1250_437(137)='X' ; ‰
CP1250_437(138)='S' ; Š
CP1250_437(139)='X' ; ‹
CP1250_437(140)='S' ; Ś
CP1250_437(141)='T' ; Ť
CP1250_437(142)='Z' ; Ž
CP1250_437(143)='Z' ; Ź
CP1250_437(144)='X' ;  
CP1250_437(145)='X' ; ‘
CP1250_437(146)='X' ; ’
CP1250_437(147)='X' ; “
CP1250_437(148)='X' ; ”
CP1250_437(149)='X' ; •
CP1250_437(150)='X' ; –
CP1250_437(151)='X' ; —
CP1250_437(152)='X' ;  
CP1250_437(153)='X' ; ™
CP1250_437(154)='s' ; š
CP1250_437(155)='X' ; ›
CP1250_437(156)='s' ; ś
CP1250_437(157)='t' ; ť
CP1250_437(158)='z' ; ž
CP1250_437(159)='z' ; ź
CP1250_437(160)=255 ;  
CP1250_437(161)='X' ; ˇ
CP1250_437(162)='X' ; ˘
CP1250_437(163)='L' ; Ł
CP1250_437(164)='X' ; ¤
CP1250_437(165)='A' ; Ą
CP1250_437(166)='X' ; ¦
CP1250_437(167)='X' ; §
CP1250_437(168)='X' ; ¨
CP1250_437(169)='X' ; ©
CP1250_437(170)='S' ; Ş
CP1250_437(171)=174 ; «
CP1250_437(172)=170 ; ¬
CP1250_437(173)='X' ; ­
CP1250_437(174)='X' ; ®
CP1250_437(175)='Z' ; Ż
CP1250_437(176)=248 ; °
CP1250_437(177)=241 ; ±
CP1250_437(178)='X' ; ˛
CP1250_437(179)='l' ; ł
CP1250_437(180)='X' ; ´
CP1250_437(181)=230 ; µ
CP1250_437(182)='X' ; ¶
CP1250_437(183)=250 ; ·
CP1250_437(184)='X' ; ¸
CP1250_437(185)='a' ; ą
CP1250_437(186)='s' ; ş
CP1250_437(187)=175 ; »
CP1250_437(188)='L' ; Ľ
CP1250_437(189)='X' ; ˝
CP1250_437(190)='l' ; ľ
CP1250_437(191)='z' ; ż
CP1250_437(192)='R' ; Ŕ
CP1250_437(193)='A' ; Á
CP1250_437(194)='A' ; Â
CP1250_437(195)='A' ; Ă
CP1250_437(196)=142 ; Ä
CP1250_437(197)='L' ; Ĺ
CP1250_437(198)='C' ; Ć
CP1250_437(199)=128 ; Ç
CP1250_437(200)='C' ; Č
CP1250_437(201)=144 ; É
CP1250_437(202)='E' ; Ę
CP1250_437(203)='E' ; Ë
CP1250_437(204)='E' ; Ě
CP1250_437(205)='I' ; Í
CP1250_437(206)='I' ; Î
CP1250_437(207)='D' ; Ď
CP1250_437(208)='D' ; Đ
CP1250_437(209)='N' ; Ń
CP1250_437(210)='N' ; Ň
CP1250_437(211)='O' ; Ó
CP1250_437(212)='O' ; Ô
CP1250_437(213)='O' ; Ő
CP1250_437(214)=153 ; Ö
CP1250_437(215)='x' ; ×
CP1250_437(216)='R' ; Ř
CP1250_437(217)='U' ; Ů
CP1250_437(218)='U' ; Ú
CP1250_437(219)='U' ; Ű
CP1250_437(220)=154 ; Ü
CP1250_437(221)='Y' ; Ý
CP1250_437(222)='T' ; Ţ
CP1250_437(223)=225 ; ß
CP1250_437(224)='r' ; ŕ
CP1250_437(225)=160 ; á
CP1250_437(226)=131 ; â
CP1250_437(227)='a' ; ă
CP1250_437(228)=132 ; ä
CP1250_437(229)='l' ; ĺ
CP1250_437(230)='c' ; ć
CP1250_437(231)=135 ; ç
CP1250_437(232)='c' ; č
CP1250_437(233)=130 ; é
CP1250_437(234)='e' ; ę
CP1250_437(235)=137 ; ë
CP1250_437(236)='e' ; ě
CP1250_437(237)=161 ; í
CP1250_437(238)=140 ; î
CP1250_437(239)='d' ; ď
CP1250_437(240)='d' ; đ
CP1250_437(241)='n' ; ń
CP1250_437(242)='n' ; ň
CP1250_437(243)=162 ; ó
CP1250_437(244)=147 ; ô
CP1250_437(245)='o' ; ő
CP1250_437(246)=148 ; ö
CP1250_437(247)=246 ; ÷
CP1250_437(248)='r' ; ř
CP1250_437(249)='u' ; ů
CP1250_437(250)=163 ; ú
CP1250_437(251)='u' ; ű
CP1250_437(252)=129 ; ü
CP1250_437(253)='y' ; ý
CP1250_437(254)='t' ; ţ
CP1250_437(255)='X' ; ˙

Structure Ascii_array : a.a[0] : EndStructure
Define *buffer.Ascii_array 
*buffer.Ascii_array=AllocateMemory(256)
For index=0 To 255 
  *buffer\a[index]=index
  *buffer\a[index] = CP1250_437(*buffer\a[index] )
Next   
ShowMemoryViewer(*buffer,256)

Code: Select all

000000000146D9D8  00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F  ................
000000000146D9E8  10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F  ................
000000000146D9F8  20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F   !"#$%&'()*+,-./
000000000146DA08  30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F  0123456789:;<=>?
000000000146DA18  40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F  @ABCDEFGHIJKLMNO
000000000146DA28  50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F  PQRSTUVWXYZ[\]^_
000000000146DA38  60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F  `abcdefghijklmno
000000000146DA48  70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F  pqrstuvwxyz{|}~
000000000146DA58  58 58 58 58 58 58 58 58 58 58 53 58 53 54 5A 5A  XXXXXXXXXXSXSTZZ
000000000146DA68  58 58 58 58 58 58 58 58 58 58 73 58 73 74 7A 7A  XXXXXXXXXXsXstzz
000000000146DA78  FF 58 58 4C 58 41 58 58 58 58 53 AE AA 58 58 5A  ÿXXLXAXXXXS®ªXXZ
000000000146DA88  F8 F1 58 6C 58 E6 58 FA 58 61 73 AF 4C 58 6C 7A  øñXlXæXúXas¯LXlz
000000000146DA98  52 41 41 41 8E 4C 43 80 43 90 45 45 45 49 49 44  RAAAŽLC€CEEEIID
000000000146DAA8  44 4E 4E 4F 4F 4F 99 78 52 55 55 55 9A 59 54 E1  DNNOOO™xRUUUšYTá
000000000146DAB8  72 A0 83 61 84 6C 63 87 63 82 65 89 65 A1 8C 64  r ƒa„lc‡c‚e‰e¡Œd
000000000146DAC8  64 6E 6E A2 93 6F 94 F6 72 75 A3 75 81 79 74 58  dnn¢“o”öru£uytX

Re: Conversion of Codepages

Posted: Mon Nov 20, 2023 1:57 am
by idle
For a one off case that's a suitable solution but if you ever need to do it to other codepages it might get tedious.

I've added a function to strip accents off characters to the utf16 module

it's in the file UTF16a.pb

https://github.com/idle-PB/UTF16

note the function does it in place

Code: Select all

XIncludeFile "UTF16a.pb"
UseModule UTF16
input_string$="ČŠŽčšž"
StrStripAccents(input_string$) 
Debug input_string$ 
CSZcsz

Re: Conversion of Codepages

Posted: Mon Nov 20, 2023 3:14 pm
by AZJIO
I had a problem how to open the ANSI file CP1251 in Linux, since the English encoding was used, now I understand how to replace bytes from 0 to 255 with UTF-8 characters

Re: Conversion of Codepages

Posted: Mon Nov 20, 2023 7:02 pm
by miskox
Thank you for all the answers. I will check/study them. Maybe over the weekend.

Thanks again.
Saso