Can someone Help me on how to convert a file in ascii encoding with accentuation like "é" in utf-8 encoding preserving charater like é ?
Thanks in advance
Convert an ascii files with accentuation in Utf-8
-
- Enthusiast
- Posts: 542
- Joined: Tue Oct 14, 2014 12:09 pm
Re: Convert an ascii files with accentuation in Utf-8
Just to clarify, even though the file is not in a unicode encoding it definitely isn't in an ASCIII encoding. ASCII only covers values 0 to 127. If it is just a single byte encoding the chances are that it is probably an ANSI or OEM codepage.
If you know the current encoding you would simply make a substitution table. You would read a byte at a time and look up the equivalent unicode value and then write out the new values to a new file as UTF-8.
So if the byte was 233 (one of the possible encodings for "é") you would substitute the unicode character value $E9 and write this out in UTF-8 encoding.
Here is a simple program layout:
If you don't know the current encoding you can look at wikipedia to check the various codepages that are possible and eliminate the ones that don't match up. You can also find the corresponding values for the unicode encodings there.
Here are some links:
A popular Windows codepage, ANSI 1252.
A list of OEM codepages.
A list of many codepages including IBM EBCDIC , various ISO formats, and other ANSI and OEM pages.
If you know the current encoding you would simply make a substitution table. You would read a byte at a time and look up the equivalent unicode value and then write out the new values to a new file as UTF-8.
So if the byte was 233 (one of the possible encodings for "é") you would substitute the unicode character value $E9 and write this out in UTF-8 encoding.
Here is a simple program layout:
Code: Select all
;Compile as Unicode
Dim unicodeSubstitutions$(255) ;this is an array of 256 values indexed by the codepage character values
;
;Fill the substitution array with the appropriate date for the codepage/encoding in question.
;
;You will have to write this part, something like this should work though...
;
; DataSection
; chardata:
; Data.u ....,....,....,.... ;x 256
; EndDataSection
;
; Restore chardata
; For i = 0 to 255: Read.u charValue: unicodeSubstitutions$(i) = Chr(charValue): Next i
sourceFileID = 0
outputFileID = 1
ReadFile(sourceFileID, "SourceFile.txt")
CreateFile(outputFileID, "OutputFile.txt", #PB_UTF8) ;all output to this file will be in UTF-8 format
While Not Eof(sourceFileID)
sourceCharByte = ReadAsciiCharacter(sourceFileID)
WriteString(outputFileID, unicodeSubstitution$(sourceCharByte))
Wend
CloseFile(sourceFileID)
CloseFile(outputFileID)
Here are some links:
A popular Windows codepage, ANSI 1252.
A list of OEM codepages.
A list of many codepages including IBM EBCDIC , various ISO formats, and other ANSI and OEM pages.
-
- Enthusiast
- Posts: 542
- Joined: Tue Oct 14, 2014 12:09 pm
Re: Convert an ascii files with accentuation in Utf-8
Hi Denivec,
Thanks for your annswer
Can you give me an exemple on how to fill
Because i don't understand what to put in area data
Thanks in advance
Thanks for your annswer
Can you give me an exemple on how to fill
Code: Select all
DataSection
; chardata:
; Data.u ....,....,....,.... ;x 256
; EndDataSection
Thanks in advance
Re: Convert an ascii files with accentuation in Utf-8
If you are on Windows and you know the Ansi codepage then you could use MultiByteToWideChar_() to convert to Unicode. From here, you can use PokeS(.., -1, #PB_UTF8) to write in UTF8 format (or WriteString() etc. if writing to a file) . Make sure you compile in unicode mode, but read the original string from the file in raw data format so as not to destroy the Ansi encoding.
I may look like a mule, but I'm not a complete ass.
Re: Convert an ascii files with accentuation in Utf-8
If you were going to use the Windows ANSI 1252 codepage you would fill it in like this:loulou2522 wrote:Hi Denivec,
Thanks for your annswer
Can you give me an exemple on how to fill
Because i don't understand what to put in area dataCode: Select all
DataSection ; chardata: ; Data.u ....,....,....,.... ;x 256 ; EndDataSection
Thanks in advance
Code: Select all
DataSection
chardata:
Data.u $0000, $0001, $0002, $0003, $0004, $0005, $0006, $0007, $0008, $0009, $000A, $000B, $000C, $000D, $000E, $000F
Data.u $0010, $0011, $0012, $0013, $0014, $0015, $0016, $0017, $0018, $0019, $001A, $001B, $001C, $001D, $001E, $001F
Data.u $0020, $0021, $0022, $0023, $0024, $0025, $0026, $0027, $0028, $0029, $002A, $002B, $002C, $002D, $002E, $002F
Data.u $0030, $0031, $0032, $0033, $0034, $0035, $0036, $0037, $0038, $0039, $003A, $003B, $003C, $003D, $003E, $003F
Data.u $0040, $0041, $0042, $0043, $0044, $0045, $0046, $0047, $0048, $0049, $004A, $004B, $004C, $004D, $004E, $004F
Data.u $0050, $0051, $0052, $0053, $0054, $0055, $0056, $0057, $0058, $0059, $005A, $005B, $005C, $005D, $005E, $005F
Data.u $0060, $0061, $0062, $0063, $0064, $0065, $0066, $0067, $0068, $0069, $006A, $006B, $006C, $006D, $006E, $006F
Data.u $0070, $0071, $0072, $0073, $0074, $0075, $0076, $0077, $0078, $0079, $007A, $007B, $007C, $007D, $007E, $007F
Data.u $20AC, $0000, $201A, $0192, $201E, $2026, $2020, $2021, $02C6, $2030, $0160, $2039, $0152, $0000, $017D, $0000
Data.u $0000, $2018, $2019, $201C, $201D, $2022, $2013, $2014, $02DC, $2122, $0161, $203A, $0153, $0000, $017E, $0178
Data.u $00A0, $00A1, $00A2, $00A3, $00A4, $00A5, $00A6, $00A7, $00A8, $00A9, $00AA, $00AB, $00AC, $00AD, $00AE, $00AF
Data.u $00B0, $00B1, $00B2, $00B3, $00B4, $00B5, $00B6, $00B7, $00B8, $00B9, $00BA, $00BB, $00BC, $00BD, $00BE, $00BF
Data.u $00C0, $00C1, $00C2, $00C3, $00C4, $00C5, $00C6, $00C7, $00C8, $00C9, $00CA, $00CB, $00CC, $00CD, $00CE, $00CF
Data.u $00D0, $00D1, $00D2, $00D3, $00D4, $00D5, $00D6, $00D7, $00D8, $00D9, $00DA, $00DB, $00DC, $00DD, $00DE, $00DF
Data.u $00E0, $00E1, $00E2, $00E3, $00E4, $00E5, $00E6, $00E7, $00E8, $00E9, $00EA, $00EB, $00EC, $00ED, $00EE, $00EF
Data.u $00F0, $00F1, $00F2, $00F3, $00F4, $00F5, $00F6, $00F7, $00F8, $00F9, $00FA, $00FB, $00FC, $00FD, $00FE, $00FF
EndSection
Do you know which codepage your file is encoded with?
-
- Enthusiast
- Posts: 542
- Joined: Tue Oct 14, 2014 12:09 pm
Re: Convert an ascii files with accentuation in Utf-8
Thanks Denivec,
that's work perfectly with French CSV files exported fromExcel
Best Regards
that's work perfectly with French CSV files exported fromExcel
Best Regards