Convert an ascii files with accentuation in Utf-8

Just starting out? Need help? Post your questions and find answers here.
loulou2522
Enthusiast
Enthusiast
Posts: 542
Joined: Tue Oct 14, 2014 12:09 pm

Convert an ascii files with accentuation in Utf-8

Post by loulou2522 »

Can someone Help me on how to convert a file in ascii encoding with accentuation like "é" in utf-8 encoding preserving charater like é ?
Thanks in advance
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Convert an ascii files with accentuation in Utf-8

Post by Demivec »

Just to clarify, even though the file is not in a unicode encoding it definitely isn't in an ASCIII encoding. ASCII only covers values 0 to 127. If it is just a single byte encoding the chances are that it is probably an ANSI or OEM codepage.

If you know the current encoding you would simply make a substitution table. You would read a byte at a time and look up the equivalent unicode value and then write out the new values to a new file as UTF-8.

So if the byte was 233 (one of the possible encodings for "é") you would substitute the unicode character value $E9 and write this out in UTF-8 encoding.

Here is a simple program layout:

Code: Select all

;Compile as Unicode

Dim unicodeSubstitutions$(255) ;this is an array of 256 values indexed by the codepage character values

;
;Fill the substitution array with the appropriate date for the codepage/encoding in question.
;
;You will have to write this part, something like this should work though...
;
; DataSection
;  chardata:
;  Data.u  ....,....,....,.... ;x 256
; EndDataSection
;
; Restore chardata
; For i = 0 to 255: Read.u charValue: unicodeSubstitutions$(i) = Chr(charValue): Next i

sourceFileID = 0
outputFileID = 1


ReadFile(sourceFileID, "SourceFile.txt")
CreateFile(outputFileID, "OutputFile.txt", #PB_UTF8) ;all output to this file will be in UTF-8 format
While Not Eof(sourceFileID)
  sourceCharByte = ReadAsciiCharacter(sourceFileID)
  WriteString(outputFileID, unicodeSubstitution$(sourceCharByte))
Wend

CloseFile(sourceFileID)
CloseFile(outputFileID)
If you don't know the current encoding you can look at wikipedia to check the various codepages that are possible and eliminate the ones that don't match up. You can also find the corresponding values for the unicode encodings there.

Here are some links:

A popular Windows codepage, ANSI 1252.
A list of OEM codepages.
A list of many codepages including IBM EBCDIC , various ISO formats, and other ANSI and OEM pages.
loulou2522
Enthusiast
Enthusiast
Posts: 542
Joined: Tue Oct 14, 2014 12:09 pm

Re: Convert an ascii files with accentuation in Utf-8

Post by loulou2522 »

Hi Denivec,
Thanks for your annswer
Can you give me an exemple on how to fill

Code: Select all

 DataSection
;  chardata:
;  Data.u  ....,....,....,.... ;x 256
; EndDataSection
Because i don't understand what to put in area data
Thanks in advance
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Re: Convert an ascii files with accentuation in Utf-8

Post by srod »

If you are on Windows and you know the Ansi codepage then you could use MultiByteToWideChar_() to convert to Unicode. From here, you can use PokeS(.., -1, #PB_UTF8) to write in UTF8 format (or WriteString() etc. if writing to a file) . Make sure you compile in unicode mode, but read the original string from the file in raw data format so as not to destroy the Ansi encoding.
I may look like a mule, but I'm not a complete ass.
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Convert an ascii files with accentuation in Utf-8

Post by Demivec »

loulou2522 wrote:Hi Denivec,
Thanks for your annswer
Can you give me an exemple on how to fill

Code: Select all

 DataSection
;  chardata:
;  Data.u  ....,....,....,.... ;x 256
; EndDataSection
Because i don't understand what to put in area data
Thanks in advance
If you were going to use the Windows ANSI 1252 codepage you would fill it in like this:

Code: Select all

DataSection
  chardata:
  Data.u $0000, $0001, $0002, $0003, $0004, $0005, $0006, $0007, $0008, $0009, $000A, $000B, $000C, $000D, $000E, $000F
  Data.u $0010, $0011, $0012, $0013, $0014, $0015, $0016, $0017, $0018, $0019, $001A, $001B, $001C, $001D, $001E, $001F
  Data.u $0020, $0021, $0022, $0023, $0024, $0025, $0026, $0027, $0028, $0029, $002A, $002B, $002C, $002D, $002E, $002F
  Data.u $0030, $0031, $0032, $0033, $0034, $0035, $0036, $0037, $0038, $0039, $003A, $003B, $003C, $003D, $003E, $003F
  Data.u $0040, $0041, $0042, $0043, $0044, $0045, $0046, $0047, $0048, $0049, $004A, $004B, $004C, $004D, $004E, $004F
  Data.u $0050, $0051, $0052, $0053, $0054, $0055, $0056, $0057, $0058, $0059, $005A, $005B, $005C, $005D, $005E, $005F
  Data.u $0060, $0061, $0062, $0063, $0064, $0065, $0066, $0067, $0068, $0069, $006A, $006B, $006C, $006D, $006E, $006F
  Data.u $0070, $0071, $0072, $0073, $0074, $0075, $0076, $0077, $0078, $0079, $007A, $007B, $007C, $007D, $007E, $007F
  Data.u $20AC, $0000, $201A, $0192, $201E, $2026, $2020, $2021, $02C6, $2030, $0160, $2039, $0152, $0000, $017D, $0000
  Data.u $0000, $2018, $2019, $201C, $201D, $2022, $2013, $2014, $02DC, $2122, $0161, $203A, $0153, $0000, $017E, $0178
  Data.u $00A0, $00A1, $00A2, $00A3, $00A4, $00A5, $00A6, $00A7, $00A8, $00A9, $00AA, $00AB, $00AC, $00AD, $00AE, $00AF
  Data.u $00B0, $00B1, $00B2, $00B3, $00B4, $00B5, $00B6, $00B7, $00B8, $00B9, $00BA, $00BB, $00BC, $00BD, $00BE, $00BF
  Data.u $00C0, $00C1, $00C2, $00C3, $00C4, $00C5, $00C6, $00C7, $00C8, $00C9, $00CA, $00CB, $00CC, $00CD, $00CE, $00CF
  Data.u $00D0, $00D1, $00D2, $00D3, $00D4, $00D5, $00D6, $00D7, $00D8, $00D9, $00DA, $00DB, $00DC, $00DD, $00DE, $00DF
  Data.u $00E0, $00E1, $00E2, $00E3, $00E4, $00E5, $00E6, $00E7, $00E8, $00E9, $00EA, $00EB, $00EC, $00ED, $00EE, $00EF
  Data.u $00F0, $00F1, $00F2, $00F3, $00F4, $00F5, $00F6, $00F7, $00F8, $00F9, $00FA, $00FB, $00FC, $00FD, $00FE, $00FF
EndSection
If you were going to fill it in for another codepage you would look up the unicode values for each character in the codepage and write those values instead. You would write all 256 values in order, with the value corresponding to byte $00 first and the 256th value would be for byte 255 (hex $ff).


Do you know which codepage your file is encoded with?
loulou2522
Enthusiast
Enthusiast
Posts: 542
Joined: Tue Oct 14, 2014 12:09 pm

Re: Convert an ascii files with accentuation in Utf-8

Post by loulou2522 »

Thanks Denivec,
that's work perfectly with French CSV files exported fromExcel
Best Regards
Post Reply