ReadFile issue with Unicode and UTF-8

Just starting out? Need help? Post your questions and find answers here.
User avatar
Samuel
Enthusiast
Enthusiast
Posts: 755
Joined: Sun Jul 29, 2012 10:33 pm
Location: United States

ReadFile issue with Unicode and UTF-8

Post by Samuel »

What's happening is any unicode or UTF-8 text file created by notepad leaves some sort of extra character at the beginning of the file.
When the files are opened in Purebasic the first line is read incorrectly.

At first I assumed this was a problem with notepad, but I have also tested unicode files created with wordpad. The result was the same as notepad.

This problem doesn't exist if the files were created by Purebasic. Which I can live with in most cases, but now I have a text file that at times will be edited by an end user.
Which is a problem because after they edit the file it is no longer readable because of the first line.

I'm running Windows 7 Professional.

Any help on why this is happening would be greatly appreciated.

Here are the notepad files that I have been testing.
ReadFile.zip

And the code I've been using to read them.

Code: Select all

Define.s Result$

If ReadFile(0, "Unicode File.txt")
  Debug "UNICODE FILE"
  While Eof(0) = 0
    
    Result$ = ReadString(0, #PB_Unicode)
    Debug Result$
    If Result$ = "Line 1"
      Debug "Line 1 is OK"
    EndIf
    
  Wend
  CloseFile(0)
Else
  MessageRequester("Information","Couldn't open the Unicode File!")
EndIf


Debug ""
If ReadFile(0, "ANSI File.txt")
  Debug "ANSI FILE"
  While Eof(0) = 0
    
    Result$ = ReadString(0, #PB_Ascii)
    Debug Result$
    If Result$ = "Line 1"
      Debug "Line 1 is OK"
    EndIf
    
  Wend
  CloseFile(0)
Else
  MessageRequester("Information","Couldn't open the ANSI File!")
EndIf


Debug ""
If ReadFile(0, "UTF8 File.txt")
  Debug "UTF8 FILE"
  While Eof(0) = 0
    
    Result$ = ReadString(0, #PB_UTF8)
    Debug Result$
    If Result$ = "Line 1"
      Debug "Line 1 is OK"
    EndIf
    
  Wend
  CloseFile(0)
Else
  MessageRequester("Information","Couldn't open the UTF8 File!")
EndIf
User avatar
Demivec
Addict
Addict
Posts: 4259
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: ReadFile issue with Unicode and UTF-8

Post by Demivec »

Samuel wrote:Any help on why this is happening would be greatly appreciated.
It is the Byte-Order-Marker (BOM).

Here is a modification of your sample code that sees if there is a BOM at the beginning by using ReadStringFormat():

Code: Select all

Procedure TestForBOM(fileID)
  Protected result = ReadStringFormat(fileID)
  Select result
    Case #PB_Ascii
      Debug "Ascii file, no BOM detected."
    Case #PB_Unicode
      Debug "Unicode file, UTF-16 (little endian) BOM detected."
    Case #PB_UTF8
      Debug "UTF-8 file, UTF-8 BOM detected."
    Case #PB_UTF16BE, #PB_UTF32, #PB_UTF32BE
      Debug "File is not in a supported format, a non-PB BOM string format detected."
      result = #PB_Ascii ;use this for all non-standard types
  EndSelect
  ProcedureReturn result
EndProcedure

Procedure ExamineTextFile(filename.s)
  Protected stringFormat
  Debug ""
  If ReadFile(0, filename)
    stringFormat = TestForBOM(0)
    While Eof(0) = 0
      
      Result$ = ReadString(0, stringFormat)
      Debug Result$
      If Result$ = "Line 1"
        Debug "Line 1 is OK"
      EndIf
      
    Wend
    CloseFile(0)
  Else
    MessageRequester("Information","Couldn't open file '" + filename + "'!")
  EndIf
  
EndProcedure

ExamineTextFile("Unicode File.txt")
ExamineTextFile("ANSI File.txt")
ExamineTextFile("UTF8 File.txt")
It should show the BOM type if detected (only 3 types tested for) and it should pass all of your 'Line OK' tests.
User avatar
Samuel
Enthusiast
Enthusiast
Posts: 755
Joined: Sun Jul 29, 2012 10:33 pm
Location: United States

Re: ReadFile issue with Unicode and UTF-8

Post by Samuel »

Just tried your code and now everything seems to be back to normal.
Thanks for the help. :D
User avatar
Samuel
Enthusiast
Enthusiast
Posts: 755
Joined: Sun Jul 29, 2012 10:33 pm
Location: United States

Re: ReadFile issue with Unicode and UTF-8

Post by Samuel »

After further testing I found one issue with your code.
Since Purebasic doesn't add a BOM to the beginning of it's files. When you open one with your TestForBOM(fileID) procedure it assumes it's a Ascii file, but it very well could be one of the other types.

Would you happen to have any ideas on how to identify what type of file Purebasic created is?

Here's an example that shows the problem I have. Since it assumed it was an Ascii file it's only collecting the first letter of the string.

Code: Select all

Procedure TestForBOM(fileID)
  Protected result = ReadStringFormat(fileID)
  Select result
    Case #PB_Ascii
      Debug "Ascii file, no BOM detected."
    Case #PB_Unicode
      Debug "Unicode file, UTF-16 (little endian) BOM detected."
    Case #PB_UTF8
      Debug "UTF-8 file, UTF-8 BOM detected."
    Case #PB_UTF16BE, #PB_UTF32, #PB_UTF32BE
      Debug "File is not in a supported format, a non-PB BOM string format detected."
      result = #PB_Ascii ;use this for all non-standard types
  EndSelect
  ProcedureReturn result
EndProcedure

Procedure ExamineTextFile(filename.s)
  Protected stringFormat
  Debug ""
  If ReadFile(0, filename)
    stringFormat = TestForBOM(0)
    While Eof(0) = 0
     
      Result$ = ReadString(0, stringFormat)
      Debug Result$
      If Result$ = "Line 1"
        Debug "Line 1 is OK"
      EndIf
     
    Wend
    CloseFile(0)
  Else
    MessageRequester("Information","Couldn't open file '" + filename + "'!")
  EndIf
 
EndProcedure

If CreateFile(0, "PB Unicode File.txt", #PB_Unicode)
  WriteStringN(0, "Line 1")
  WriteStringN(0, "Line 2")
  WriteStringN(0, "Line 3")
  CloseFile(0)
Else
  MessageRequester("Information","Can't create PB Unicode File.")
EndIf

ExamineTextFile("PB Unicode File.txt")
;ExamineTextFile("Unicode File.txt")
;ExamineTextFile("ANSI File.txt")
;ExamineTextFile("UTF8 File.txt")
Little John
Addict
Addict
Posts: 4775
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: ReadFile issue with Unicode and UTF-8

Post by Little John »

Samuel wrote:Since Purebasic doesn't add a BOM to the beginning of it's files.
What do you mean by "it's files"?
If you mean files created by your PureBasic code, then it's up to you to write a suitable BOM at the beginning of a file by using WriteStringFormat(), if wanted.
User avatar
Samuel
Enthusiast
Enthusiast
Posts: 755
Joined: Sun Jul 29, 2012 10:33 pm
Location: United States

Re: ReadFile issue with Unicode and UTF-8

Post by Samuel »

Yep, you're right. I wasn't setting up the BOM for my files.
Thank you both for the help.
User avatar
Demivec
Addict
Addict
Posts: 4259
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: ReadFile issue with Unicode and UTF-8

Post by Demivec »

Samuel wrote:Would you happen to have any ideas on how to identify what type of file Purebasic created is?
First, a comment about BOM's. You don't need a BOM if you already know what type of file is expected. The BOM is to designate what kind of file it is if more than one type is allowed (or comes from a 3rd party source, i.e. not your own).

As you have already found out, if it's your file you can write a BOM to help set things straight.

As a programming challenge I'd say it wouldn't be too difficult to also either Unicode or UTF-8 encoding by examining characters from the file. As long as the file wasn't something besides text I think you would stand a favorable chance of determining it's encoding with a high likelihood of being right.

For Unicode, if in the first 200 characters of the file more than 25% of them are nulls and that those same nulls favor either the odd or even numbered bytes consistently you probably are looking at a Unicode file. You can refine those percentages with a little testing of your target file's sample content.

If you wanted to detect a UTF-8 file you could first try to eliminate Unicode encoding as a possibility. Then you would look for the patterns that make up multi-byte UTF-8 characters, single-byte UTF-8 is the same as ASCII.


I'm tempted to write some code to demonstrate how this kind of detection should work. Maybe later. :)
Little John
Addict
Addict
Posts: 4775
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: ReadFile issue with Unicode and UTF-8

Post by Little John »

Demivec wrote:As a programming challenge I'd say it wouldn't be too difficult to also either Unicode or UTF-8 encoding by examining characters from the file. As long as the file wasn't something besides text I think you would stand a favorable chance of determining it's encoding with a high likelihood of being right.
I think a function for detecting the format of a text file (ASCII, UTF-8 etc.) without a BOM would be pretty useful, since those files often don't have a BOM (and according to Wikipedia, the Unicode Standard does not recommend its use in UTF-8).
Demivec wrote:I'm tempted to write some code to demonstrate how this kind of detection should work. Maybe later. :)
How much later? :D
User avatar
Demivec
Addict
Addict
Posts: 4259
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: ReadFile issue with Unicode and UTF-8

Post by Demivec »

Little John wrote:
Demivec wrote:I'm tempted to write some code to demonstrate how this kind of detection should work. Maybe later. :)
How much later? :D
I'll work to find some time, perhaps by this weekend or within a week from now. :)

When I do I'll post it in the Tips n' Tricks forum and invite comments there as well as improvements.
Amilcar Matos
User
User
Posts: 43
Joined: Thu Nov 27, 2014 3:10 pm
Location: San Juan, Puerto Rico

Re: ReadFile issue with Unicode and UTF-8

Post by Amilcar Matos »

Merry Christmas Demivec! :)
Just a small frame for your BOM procedure.

Code: Select all

;{- Program header
;==Code Header Comment==============================
;        Name/title: BOMBOMDetector.pb
;   Executable name: BOMBOMDetector.exe
;           Version: 1.0
;            Author: Demivec
;     Collaborators: Amílcar Matos Pérez
;    Translation by: 
;       Create date: 03/Dec/2015
; Previous releases: 
;Most recent update: 
; Release date/hour: 
;  Operating system: Windows  [X]GUI
;  Compiler version: PureBasic 5.31 (x86)
;         Copyright: (C)2015 AMP All rights reserved.
;           License: 
;         Libraries: 
;             Forum: http://www.purebasic.fr/english/viewtopic.php?f=13&t=64180&sid=ec0e244cfabf06876bfd82d3f709cc1c
;  Tested platforms: Windows
;       Explanation: To detect the file encoding by examining the file content.
; ==================================================
;.......10........20........30........40........50........60........70........80
;}

;{ Declare procedures
Declare.s BrowseProcedure()
Declare.l ClearWindowDataEntryFields (Window_BOMBOM)
Declare OpenWindow_BOMBOM (x = 0, y = 0, width = 600, height = 400)
Declare.l Window_Events (Event_BOMBOM)
;}

;{ Variable exposure stmts
Global Window_BOMBOM
Global BrowseButton_BOMBOM
Global ClipBoardButton_BOMBOM
Global ClearButton_BOMBOM
Global DetectButton_BOMBOM
Global ExitButton_BOMBOM
Global FileNameStr
Global ResultsEditor
Global Text_0
Global Text_1
;}

Enumeration FormFont
  #Font_Window_BOMBOM_0
EndEnumeration

LoadFont(#Font_Window_BOMBOM_0,"Consolas", 14)

OpenWindow_BOMBOM()
ClearWindowDataEntryFields(Window_BOMBOM)
  
Repeat
  Event_BOMBOM = WaitWindowEvent()
  Quit_BOMBOM  = Window_Events(Event_BOMBOM)
Until Quit_BOMBOM = 0

End 

Procedure.l Window_Events(Event_BOMBOM)
  
  Select Event_BOMBOM
    Case #PB_Event_CloseWindow
      ProcedureReturn #False
           
    Case #PB_Event_Gadget
      Select EventGadget()        
          
        Case ExitButton_BOMBOM        ;{- Exit"
          ProcedureReturn #False      ;}        
          
        Case ClearButton_BOMBOM       ;{- Clear Data Entry Fields
          ClearWindowDataEntryFields(Window_BOMBOM)
          ProcedureReturn #True       ;}     
          
        Case BrowseButton_BOMBOM      ;{- Browse for file to test.
          SetGadgetText(FileNameStr, BrowseProcedure())
          ProcedureReturn #True       ;}     
          
        Case ClipBoardButton_BOMBOM   ;{- Copy the results to the clipboard.
          SetClipboardText(GetGadgetText(ResultsEditor))
          ProcedureReturn #True       ;}     
          
        Case DetectButton_BOMBOM      ;{- Detect BOM 
          ; TODO Insert here the BOM detector procedure.
          ProcedureReturn #True       ;}     
          
      EndSelect
  EndSelect
  ProcedureReturn #True
EndProcedure

Procedure.l ClearWindowDataEntryFields(Window_BOMBOM)
  ;{- Procedure explanation
  ; To blank the screen data entry fields.
  ;}
  
  SetGadgetText(FileNameStr  , #NULL$)  
  SetGadgetText(ResultsEditor, #NULL$)  
  SetActiveGadget(FileNameStr)
  
  ProcedureReturn #True
  
EndProcedure     ; ClearWindowDataEntryFields(Window_BOMBOM)

Procedure.s BrowseProcedure()
  ;{- Procedure explanation
  ; To ease the file selection task.
  ;}
  
  ;{- Protected variables
  Protected StandardFile$
  Protected File$
  Protected Pattern$
  ;}
  
  StandardFile$ = "C:\"   ; set initial file+path to display
                          ; With next string we will set the search patterns ("|" as separator) for file displaying:
                          ;  1st: "Text (*.txt)" as name, ".txt" and ".bat" as allowed extension
                          ;  2nd: "PureBasic (*.pb)" as name, ".pb" as allowed extension
                          ;  3rd: "All files (*.*) as name, "*.*" as allowed extension, valid for all files
  Pattern$ = "Text (*.txt)|*.txt;*.bat|PureBasic (*.pb)|*.pb|All files (*.*)|*.*"
  Pattern  = 0    ; use the first of the three possible patterns as standard
  File$    = OpenFileRequester("Please choose file to test", StandardFile$, Pattern$, Pattern)
  
  ProcedureReturn File$
  
EndProcedure     ; BrowseProcedure

; PB Forms Code
; This code is automatically generated by the FormDesigner.
; Manual modification is possible to adjust existing commands, but anything else will be dropped when the code is compiled.
; Event procedures needs to be put in another source file.
;
Procedure OpenWindow_BOMBOM(x = 0, y = 0, width = 600, height = 400)
  
  Window_BOMBOM = OpenWindow(#PB_Any, x, y, width, height, "BOM-BOM Detector", #PB_Window_SystemMenu | #PB_Window_MinimizeGadget | #PB_Window_ScreenCentered)
  
  CreateStatusBar(0, WindowID(Window_BOMBOM))
  AddStatusBarField(150)
  StatusBarText(0, 0, "(c) 2015 Demivec")
  
  ExitButton_BOMBOM      = ButtonGadget(#PB_Any, 510, 330, 60 , 30, "Exit"  )  
  ClearButton_BOMBOM     = ButtonGadget(#PB_Any, 280, 330, 100, 30, "Clear" )  
  DetectButton_BOMBOM    = ButtonGadget(#PB_Any, 400, 330, 100, 30, "Detect")  
  BrowseButton_BOMBOM    = ButtonGadget(#PB_Any, 510,  40,  60, 30, "Browse")  
  ClipBoardButton_BOMBOM = ButtonGadget(#PB_Any, 510, 280,  60, 30, "Copy To Clipboard", #PB_Button_MultiLine)  
  
  Text_0 = TextGadget(#PB_Any, 10, 40, 60, 15, "Filename")  
  Text_1 = TextGadget(#PB_Any, 10, 90, 50, 15, "Results" )  
  FileNameStr   = StringGadget(#PB_Any, 70, 40 , 430, 30 , #NULL$)
  ResultsEditor = EditorGadget(#PB_Any, 20, 110, 480, 200)    
  SetGadgetFont(FileNameStr  , FontID(#Font_Window_BOMBOM_0))  
  SetGadgetFont(ResultsEditor, FontID(#Font_Window_BOMBOM_0))  
  
EndProcedure
Little_man
Enthusiast
Enthusiast
Posts: 152
Joined: Fri Mar 29, 2013 4:55 pm
Location: The Netherland

Re: ReadFile issue with Unicode and UTF-8

Post by Little_man »

With code of: "Addict"


Debug #PB_Unicode
;By Purebasic 4.61, x86 ---> 9.
;By Purebasic 5.11, x86 ---> 25.
;By Purebasic 5.30, x86 ---> 25.
Post Reply