Page 1 of 1

Convert an ascii files with accentuation in Utf-8

Posted: Tue Feb 09, 2016 10:29 pm
by loulou2522
Can someone Help me on how to convert a file in ascii encoding with accentuation like "é" in utf-8 encoding preserving charater like é ?
Thanks in advance

Re: Convert an ascii files with accentuation in Utf-8

Posted: Wed Feb 10, 2016 2:07 am
by Demivec
Just to clarify, even though the file is not in a unicode encoding it definitely isn't in an ASCIII encoding. ASCII only covers values 0 to 127. If it is just a single byte encoding the chances are that it is probably an ANSI or OEM codepage.

If you know the current encoding you would simply make a substitution table. You would read a byte at a time and look up the equivalent unicode value and then write out the new values to a new file as UTF-8.

So if the byte was 233 (one of the possible encodings for "é") you would substitute the unicode character value $E9 and write this out in UTF-8 encoding.

Here is a simple program layout:

Code: Select all

;Compile as Unicode

Dim unicodeSubstitutions$(255) ;this is an array of 256 values indexed by the codepage character values

;
;Fill the substitution array with the appropriate date for the codepage/encoding in question.
;
;You will have to write this part, something like this should work though...
;
; DataSection
;  chardata:
;  Data.u  ....,....,....,.... ;x 256
; EndDataSection
;
; Restore chardata
; For i = 0 to 255: Read.u charValue: unicodeSubstitutions$(i) = Chr(charValue): Next i

sourceFileID = 0
outputFileID = 1


ReadFile(sourceFileID, "SourceFile.txt")
CreateFile(outputFileID, "OutputFile.txt", #PB_UTF8) ;all output to this file will be in UTF-8 format
While Not Eof(sourceFileID)
  sourceCharByte = ReadAsciiCharacter(sourceFileID)
  WriteString(outputFileID, unicodeSubstitution$(sourceCharByte))
Wend

CloseFile(sourceFileID)
CloseFile(outputFileID)
If you don't know the current encoding you can look at wikipedia to check the various codepages that are possible and eliminate the ones that don't match up. You can also find the corresponding values for the unicode encodings there.

Here are some links:

A popular Windows codepage, ANSI 1252.
A list of OEM codepages.
A list of many codepages including IBM EBCDIC , various ISO formats, and other ANSI and OEM pages.

Re: Convert an ascii files with accentuation in Utf-8

Posted: Wed Feb 10, 2016 8:21 am
by loulou2522
Hi Denivec,
Thanks for your annswer
Can you give me an exemple on how to fill

Code: Select all

 DataSection
;  chardata:
;  Data.u  ....,....,....,.... ;x 256
; EndDataSection
Because i don't understand what to put in area data
Thanks in advance

Re: Convert an ascii files with accentuation in Utf-8

Posted: Wed Feb 10, 2016 10:57 am
by srod
If you are on Windows and you know the Ansi codepage then you could use MultiByteToWideChar_() to convert to Unicode. From here, you can use PokeS(.., -1, #PB_UTF8) to write in UTF8 format (or WriteString() etc. if writing to a file) . Make sure you compile in unicode mode, but read the original string from the file in raw data format so as not to destroy the Ansi encoding.

Re: Convert an ascii files with accentuation in Utf-8

Posted: Wed Feb 10, 2016 11:06 am
by Demivec
loulou2522 wrote:Hi Denivec,
Thanks for your annswer
Can you give me an exemple on how to fill

Code: Select all

 DataSection
;  chardata:
;  Data.u  ....,....,....,.... ;x 256
; EndDataSection
Because i don't understand what to put in area data
Thanks in advance
If you were going to use the Windows ANSI 1252 codepage you would fill it in like this:

Code: Select all

DataSection
  chardata:
  Data.u $0000, $0001, $0002, $0003, $0004, $0005, $0006, $0007, $0008, $0009, $000A, $000B, $000C, $000D, $000E, $000F
  Data.u $0010, $0011, $0012, $0013, $0014, $0015, $0016, $0017, $0018, $0019, $001A, $001B, $001C, $001D, $001E, $001F
  Data.u $0020, $0021, $0022, $0023, $0024, $0025, $0026, $0027, $0028, $0029, $002A, $002B, $002C, $002D, $002E, $002F
  Data.u $0030, $0031, $0032, $0033, $0034, $0035, $0036, $0037, $0038, $0039, $003A, $003B, $003C, $003D, $003E, $003F
  Data.u $0040, $0041, $0042, $0043, $0044, $0045, $0046, $0047, $0048, $0049, $004A, $004B, $004C, $004D, $004E, $004F
  Data.u $0050, $0051, $0052, $0053, $0054, $0055, $0056, $0057, $0058, $0059, $005A, $005B, $005C, $005D, $005E, $005F
  Data.u $0060, $0061, $0062, $0063, $0064, $0065, $0066, $0067, $0068, $0069, $006A, $006B, $006C, $006D, $006E, $006F
  Data.u $0070, $0071, $0072, $0073, $0074, $0075, $0076, $0077, $0078, $0079, $007A, $007B, $007C, $007D, $007E, $007F
  Data.u $20AC, $0000, $201A, $0192, $201E, $2026, $2020, $2021, $02C6, $2030, $0160, $2039, $0152, $0000, $017D, $0000
  Data.u $0000, $2018, $2019, $201C, $201D, $2022, $2013, $2014, $02DC, $2122, $0161, $203A, $0153, $0000, $017E, $0178
  Data.u $00A0, $00A1, $00A2, $00A3, $00A4, $00A5, $00A6, $00A7, $00A8, $00A9, $00AA, $00AB, $00AC, $00AD, $00AE, $00AF
  Data.u $00B0, $00B1, $00B2, $00B3, $00B4, $00B5, $00B6, $00B7, $00B8, $00B9, $00BA, $00BB, $00BC, $00BD, $00BE, $00BF
  Data.u $00C0, $00C1, $00C2, $00C3, $00C4, $00C5, $00C6, $00C7, $00C8, $00C9, $00CA, $00CB, $00CC, $00CD, $00CE, $00CF
  Data.u $00D0, $00D1, $00D2, $00D3, $00D4, $00D5, $00D6, $00D7, $00D8, $00D9, $00DA, $00DB, $00DC, $00DD, $00DE, $00DF
  Data.u $00E0, $00E1, $00E2, $00E3, $00E4, $00E5, $00E6, $00E7, $00E8, $00E9, $00EA, $00EB, $00EC, $00ED, $00EE, $00EF
  Data.u $00F0, $00F1, $00F2, $00F3, $00F4, $00F5, $00F6, $00F7, $00F8, $00F9, $00FA, $00FB, $00FC, $00FD, $00FE, $00FF
EndSection
If you were going to fill it in for another codepage you would look up the unicode values for each character in the codepage and write those values instead. You would write all 256 values in order, with the value corresponding to byte $00 first and the 256th value would be for byte 255 (hex $ff).


Do you know which codepage your file is encoded with?

Re: Convert an ascii files with accentuation in Utf-8

Posted: Wed Feb 10, 2016 12:55 pm
by loulou2522
Thanks Denivec,
that's work perfectly with French CSV files exported fromExcel
Best Regards