Detecting Text File Encoding without BOM

Share your advanced PureBasic knowledge/code with the community.
User avatar
Demivec
Addict
Addict
Posts: 4086
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Detecting Text File Encoding without BOM

Post by Demivec »

Here is a detection routine for detecting the encoding (unicode types or ASCII) and byte order for text files.

It may need some improvements. If there are any suggestions or comments I'll see what can be done to reasonably incorporate them.

The code does not detect codepages for extended ASCII character codes.

It will attempt to determine the same encoding types as those returned by PureBasic's ReadStringFormat() command which looks for a BOM at the beginning of a file.

I will post the code in the next two messages and leave a blank message for any additional things that may be needed to be posted at the beginning of the thread in the future.

Please test the code and let me know if certain things are useful or useless. It's good to know either way. :)
User avatar
Demivec
Addict
Addict
Posts: 4086
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Detecting Text File Encoding without BOM

Post by Demivec »

First a module include called AutoDetectTextEncoding.pbi:

Code: Select all

;{- Program header
;==Code Header Comment==============================
;        Name/title: AutoDetectTextEncoding.pbi
;           Version: 1.0
;            Author: Demivec
;       Create date: 24/Dec/2015
;  Operating system: Windows  [X]GUI
;  Compiler version: PureBasic 5.41 (x64)
;           License: Free to use/abuse/modify.
;                   // THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
;                   // ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
;                   // IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
;                   // ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
;                   // FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
;                   // DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
;                   // OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
;                   // HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
;                   // LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
;                   // OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
;                   // SUCH DAMAGE.
;             Forum: http://www.purebasic.fr/english/viewtopic.php?f=13&t=64180&sid=ec0e244cfabf06876bfd82d3f709cc1c
;                    http://www.purebasic.fr/english/viewtopic.php?f=12&t=64385
;  Tested platforms: Windows
;       Explanation: To detect the file encoding by examining the file content.
;                    Includes detection of Unicode (UTF-8, UTF-16, UTF-32) big and little endian encodings.
;                    UTF-16 also includes UCS-2.  Does not include detection of code page for ASCII.
; ==================================================
;}

;module DetectTextEncoding
DeclareModule dte
  EnableExplicit
  
  ;flags for use with detect encoding functions
  EnumerationBinary
    #disallowNulls     ;don't allow Nulls to be in ASCII or a unicode encoding (UTF-8, UTF-16, and UTF-32)
    #enforceProperSurrogateCodes ;surrogate codes need to be paired in a proper lead/tail fashion
    #restrictSurrogateCodesToUTF16 ;allow surrogate codes only in UTF-16 encodings
  EndEnumeration
  
  ;constants for custom decoding results, usually errors
  #err_detectionProcessFailedToStart = -1 ;error given for memmory or filename errors in detection process
  #err_emptyFile = -2 ;error given for an empty file
  #encodingUndetermined = 0
  
  #maxLength = $10000 ;64k is the maximum to read from a file for detection purposes
  
  Declare.i isNativePureBasicFormat(type) ;Returns #True if 'type' is a name PureBasic format
  Declare.s textForDetectedStringFormat(type) ;translates a detection result into text
  
  ;Each of the detection functions returns a constant for the type if successful, #encodingUndetermined if
  ;not successful, or #err_detectionProcessFailedToStart if an error was encountered.
  Declare.i detectTextEncodingInBuffer(*buffer, length, flags = 0) ;returns type
  Declare.i detectTextEncodingInFile(filename.s, length = -1, flags = 0) ;returns type
  Declare detectTextEncodingOfBuffers(Array *buffers(1), Array results(1), length, flags = 0) ;returns results in an array
  Declare detectTextEncodingOfFiles(Array filenames.s(1), Array results(1), length = -1, flags = 0) ;returns results in an array
EndDeclareModule

Module dte
  Enumeration
    #bigEndian = 0
    #littleEndian = 1
  EndEnumeration
  
  Procedure isNativePureBasicFormat(type)  ;Returns #True if 'type' is a name PureBasic format
    Protected result
    Select type
      Case #PB_Ascii, #PB_Unicode, #PB_UTF8
        result = #True
      Default ;includes #PB_UTF16BE, #PB_UTF32, #PB_UTF32BE
        result = #False
    EndSelect
    ProcedureReturn result
  EndProcedure
  
  Procedure.s textForDetectedStringFormat(type) ;translates a detection result into text
    Protected result.s
    Select type
      Case #PB_Ascii
        result = "ASCII"
      Case #PB_Unicode
        result = "Unicode"
      Case #PB_UTF8
        result = "UTF-8"
      Case #PB_UTF16BE
        result = "UTF-16BE"
      Case #PB_UTF32
        result = "UTF-32"
      Case #PB_UTF32BE
        result = "UTF-32BE"
      Case #encodingUndetermined
        result = "Undetermined"
      Case #err_emptyFile
        result = "File is empty"
      Case #err_detectionProcessFailedToStart
        result = "Detection process failed to start"
    EndSelect
    ProcedureReturn result
  EndProcedure
  
  Procedure countHigherValueIndex(Array s(1), Array n(1))
    ;increment the index of s() that corresponds to the element of n() that is the highest
    If n(0) > n(1)
      s(0) + 1
    ElseIf n(1) > n(0)
      s(1) + 1
    EndIf
  EndProcedure
  
  Procedure.i detectTextEncodingInBuffer(*buffer, length, flags = 0)
    ;The return value will be one of the BOM for string format types listed
    ;in PureBasic's ReadStringFormat() function or it will return 0
    ;for an undetermined format.
    Protected *bufferPtr.Ascii = *buffer, *endPtr = *buffer + length - 1
    Protected isNotASCII, isNotUTF8, isNotUTF16, isNotUTF32
    Protected remainingUTF8_ByteCount, nonASCII_UTF8_Count, workingByte
    Protected byteParity_2, UTF8_surrogateCheckStatus
    Protected byteParity_4, UTF16_commonCharacterCount, i, result
    Protected disallowNulls = flags & #disallowNulls
    Protected enforceSurrogates = flags & #enforceProperSurrogateCodes
    Protected restrictSurrogates = flags & #restrictSurrogateCodesToUTF16
    
    Dim UTF16_nullCount(1)              ;count of nulls in high bytes for each endianness
    Dim isUTF16_EndiannessEliminated(1) ;boolean values
    Dim UTF16_surrogateEndianness(1)    ;count of surrogates for UTF-16 endianness detection
    Dim UTF16_leadSurrogateAddress(1)   ;buffer address at which leadSurrogate was found for each endianness
    Dim UTF16_codePoint.u(1)            ;complete code point interpretation for each endianness
    Dim UTF16_commonCharacterCount(1)   ;count of common code points (i.e. for Space, Tab, CR, LF) for each endiannness
    Dim UTF16_EndiannessStatisticsEval(2) ;sum of endianness probability, indexes {0,1} relate to each endianness
    Dim UTF32_codePoint.l(1)              ;complete code point interpretation for each endianness
    Dim isUTF32_EndiannessEliminated(1)   ;boolean values
    Dim UTF32_surrogateEndianness(1)      ;count of surrogates for UTF-32 endianness detection
    Dim UTF32_leadSurrogateAddress(1)     ;buffer address at which leadSurrogate was found for each endianness
    
    If Not *buffer
      ProcedureReturn #err_detectionProcessFailedToStart
    EndIf
    
    While *bufferPtr <= *endPtr
      If isNotASCII = #False
        If *bufferPtr\a = 0 And disallowNulls
          isNotASCII = #True
        EndIf   
      EndIf
      
      If isNotUTF8 = #False
        ;This encoding is ruled out if a mismatch of surrogate pairs is found or a Null is found
        ;when it is not allowed.
        
        ;Invalid code points (according to Wikipedia)
        ;According to the UTF-8 definition (RFC 3629) the high and low surrogate
        ;halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode
        ;values, and their UTF-8 encoding should be treated as an invalid byte
        ;sequence.
        ;
        ;Whether an actual application should do this is debatable, as it makes it
        ;impossible to store invalid UTF-16 (that is, UTF-16 with unpaired
        ;surrogate halves) in a UTF-8 string. This is necessary to store unchecked
        ;UTF-16 such as Windows filenames as UTF-8. It is also incompatible with
        ;CESU-8 encoding (described below).
        ;
        ;Many programs added UTF-8 conversions for UCS-2 data and did not alter
        ;this UTF-8 conversion when UCS-2 was replaced with the surrogate-pair
        ;using UTF-16. In such programs each half of a UTF-16 surrogate pair is
        ;encoded as its own 3-byte UTF-8 encoding, resulting in 6-byte sequences
        ;rather than 4 bytes for characters outside the Basic Multilingual Plane.
        ;Oracle and MySQL databases use this, as well as Java and Tcl as described
        ;below, and probably many Windows programs where the programmers were
        ;unaware of the complexities of UTF-16. Although this non-optimal encoding
        ;is generally not deliberate, a supposed benefit is that it preserves
        ;UTF-16 binary sorting order when CESU-8 is binary sorted.
        
        Select *bufferPtr\a & %11000000
          Case %00000000 ;ASCII byte
            If remainingUTF8_ByteCount > 0 ;in the middle of a multi-byte code point
              isNotUTF8 = #True
              remainingUTF8_ByteCount = 0 ;start over
            EndIf
            
            If *bufferPtr\a = 0 And disallowNulls
              isNotUTF8 = #True
            EndIf
          Case %10000000 ;continuation byte of a multi-byte code point
            If remainingUTF8_ByteCount = 0
              isNotUTF8 = #True ;error, not looking for any more bytes of a multi-byte code point
            Else
              remainingUTF8_ByteCount - 1
              If enforceSurrogates
                If remainingUTF8_ByteCount = 1 And UTF8_surrogateCheckStatus <> 0
                  ;compare second byte of multi-byte code point to check lead/tail ordering of surrogates
                  Select *bufferPtr\a & %10110000
                    Case %10100000 ;lead surrogate
                      If UTF8_surrogateCheckStatus <> 1 
                        isNotUTF8 = #True ;error, tail surrogate should come after previous lead surrogate
                      Else
                        UTF8_surrogateCheckStatus = 2
                      EndIf
                    Case %10110000 ;tail surrogate
                      If UTF8_surrogateCheckStatus <> 3
                        isNotUTF8 = #True ;error, tail surrogate came before lead surrogate
                      Else
                        UTF8_surrogateCheckStatus = 0 ;proper pairing, reset
                      EndIf
                  EndSelect
                EndIf
              EndIf
              If remainingUTF8_ByteCount = 0
                nonASCII_UTF8_Count + 1
              EndIf
            EndIf
          Case %11000000 ;start byte of multi-byte code point (1 - 3 more bytes)
            If remainingUTF8_ByteCount
              isNotUTF8 = #True ;error, still looking for more bytes of a multi-byte code point
            Else
              ;calculate number of remaining bytes
              workingByte = *bufferPtr\a << 1
              remainingUTF8_ByteCount = 0
              While workingByte & %10000000 > 0
                remainingUTF8_ByteCount + 1
                workingByte << 1
              Wend
              
              ;reject UTF-8 sequences (>4 bytes) and also ones that carry no payload in the first byte
              If remainingUTF8_ByteCount > 3 Or workingByte = 0
                isNotUTF8 = #True ;error for an overlong sequence
              EndIf
              
              If *bufferPtr\a = $ED ;byte starts a sequence for a surrogate encoding 
                If restrictSurrogates
                  isNotUTF8 = #True ;error, invalid
                EndIf
                
                UTF8_surrogateCheckStatus + 1 ;advance status to check next byte for lead/tail code
              EndIf
            EndIf
        EndSelect
      EndIf
      
      If isNotUTF16 = #False
        byteParity_2 = (*bufferPtr - *buffer) % 2 ;byte parity of the offset will equal (1) on a completed code point boundary
        If *bufferPtr\a = 0
          UTF16_nullCount(byteParity_2) + 1
        EndIf
        
        ;build complete code points to test
        UTF16_codePoint(#bigEndian) = UTF16_codePoint(#bigEndian) << 8 + *bufferPtr\a
        UTF16_codePoint(#littleEndian) = UTF16_codePoint(#littleEndian) >> 8 + *bufferPtr\a << 8
        
        If byteParity_2 = 1 ;we're on a code point boundary (byte parity of the offset is 'odd')
          For i = #bigEndian To #littleEndian
            If isUTF16_EndiannessEliminated(i) = #False
              Select UTF16_codePoint(i)
                Case $0020, $000A, $000D, $0009 ;space, LF, CR, Tab
                  UTF16_commonCharacterCount(i) + 1
                Case $0000          ;#Null
                  If disallowNulls
                    isNotUTF16 = #True
                  EndIf               
                Case $D800 To $DBFF ;lead surrogate 
                  If enforceSurrogates
                    If UTF16_leadSurrogateAddress(i)
                      isUTF16_EndiannessEliminated(i) = #True
                      If isUTF16_EndiannessEliminated(i ! 1)
                        isNotUTF16 = #True
                      EndIf
                    EndIf  
                  EndIf
                Case $DC00 To $DFFF ;tail surrogate
                  If enforceSurrogates
                    If UTF16_leadSurrogateAddress(i) = *bufferPtr - SizeOf(Unicode)
                      UTF16_surrogateEndianness(i) + 1
                      UTF16_leadSurrogateAddress(i) = 0 ;reset value
                    Else
                      isUTF16_EndiannessEliminated(i) = #True
                      If isUTF16_EndiannessEliminated(i ! 1)
                        isNotUTF16 = #True
                      EndIf
                    EndIf
                  EndIf
              EndSelect
            EndIf
          Next
        EndIf
      EndIf
      
      If isNotUTF32 = #False
        ;Because surrogate code points are not included in the set of Unicode scalar values,
        ;UTF-32 code units in the range $0000D800..$0000DFFF are ill-formed.
        ;Any UTF-32 code unit greater than $0010FFFF is ill-formed.
        ;
        ;UTF-32 is forbidden from storing the non-character code points that are illegal for
        ;interchange, such as 0xFFFF, 0xFFFE, and the all the surrogates.
        ;UTF is a transport encoding, not an internal one.
        ;
        ;According to stackoverflow.com:
        ;But UTF-32 is easy to detect even without a BOM. This is because the
        ;Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units
        ;always have the pattern 00 {0x|10} xx xx (for BE) or xx xx {0x|10} 00 (for
        ;LE). If the data has a length that's a multiple of 4, and follows one of
        ;these patterns, you can safely assume it's UTF-32. False positives are
        ;nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.
        
        byteParity_4 = (*bufferPtr - *buffer) % 4 ;byte parity of the offset will be 3 on a completed code point boundary
        
        Select byteParity_4
          Case 0
            If *bufferPtr\a <> 0 And isUTF32_EndiannessEliminated(#bigEndian) = #False
              isUTF32_EndiannessEliminated(#bigEndian) = #True
              If isUTF32_EndiannessEliminated(#littleEndian)
                isNotUTF32 = #True
              EndIf
            EndIf
          Case 1
            If *bufferPtr\a > $10 And isUTF32_EndiannessEliminated(#bigEndian) = #False
              isUTF32_EndiannessEliminated(#bigEndian) = #True
              If isUTF32_EndiannessEliminated(#littleEndian)
                isNotUTF32 = #True
              EndIf
            EndIf
            
          Case 2
            If *bufferPtr\a > $10 And isUTF32_EndiannessEliminated(#littleEndian) = #False
              isUTF32_EndiannessEliminated(#littleEndian) = #True
              If isUTF32_EndiannessEliminated(#bigEndian)
                isNotUTF32 = #True
              EndIf
            EndIf
          Case 3
            If *bufferPtr\a <> 0 And isUTF32_EndiannessEliminated(#littleEndian) = #False
              isUTF32_EndiannessEliminated(#littleEndian) = #True
              If isUTF32_EndiannessEliminated(#bigEndian)
                isNotUTF32 = #True
              EndIf
            EndIf
        EndSelect
        
        ;build complete code points to test
        UTF32_codePoint(#bigEndian) = UTF32_codePoint(#bigEndian) << 8 + *bufferPtr\a
        UTF32_codePoint(#littleEndian) = (UTF32_codePoint(#littleEndian) >> 8) & $00FFFFFF
        UTF32_codePoint(#littleEndian) + *bufferPtr\a << 24
        
        If byteParity_4 = 3 ;we're on a code point boundary
          For i = #bigEndian To #littleEndian
            If isUTF32_EndiannessEliminated(i) = #False
              Select UTF32_codePoint(i)
                Case $00000000      ;#Null
                  If disallowNulls
                    isNotUTF32 = #True
                  EndIf               
                Case $0000D800 To $0000DFFF ;lead surrogate and tail surrogates for UTF16
                  If restrictSurrogates
                    isUTF32_EndiannessEliminated(i) = #True ;error, found a surrogate code
                    If isUTF32_EndiannessEliminated(i ! 1)
                      isNotUTF32 = #True
                    EndIf
                  ElseIf enforceSurrogates
                    ;need to test for lead/tail ordering and unmatched pairs
                    If UTF32_codePoint(i) <$0000DC00 ;lead surrogate
                      If UTF32_leadSurrogateAddress(i)
                        isUTF32_EndiannessEliminated(i) = #True
                        If isUTF32_EndiannessEliminated(i ! 1)
                          isNotUTF32 = #True
                        EndIf
                      EndIf  
                    Else                             ;tail surrogate
                      If UTF32_leadSurrogateAddress(i) = *bufferPtr - SizeOf(Unicode)
                        UTF32_surrogateEndianness(i) + 1
                        UTF32_leadSurrogateAddress(i) = 0 ;reset value
                      Else
                        isUTF32_EndiannessEliminated(i) = #True
                        If isUTF32_EndiannessEliminated(i ! 1)
                          isNotUTF32 = #True
                        EndIf
                      EndIf
                    EndIf
                  EndIf
                Case $0000FFFE To $0000FFFF ;forbidden non-character code points
                  isUTF32_EndiannessEliminated(i) = #True
                  If isUTF32_EndiannessEliminated(i ! 1)
                    isNotUTF32 = #True
                  EndIf
              EndSelect
            EndIf
          Next
        EndIf
      EndIf
      
      *bufferPtr + SizeOf(Ascii)
    Wend

    ;Examine statics and determine most likely encoding type.
    ;Detection order is UTF-8, UTF-32 (both endians), UTF-16 (both endians), ASCII, else undetermined (a.k.a binary).
    countHigherValueIndex(UTF16_EndiannessStatisticsEval(), UTF16_nullCount())
    countHigherValueIndex(UTF16_EndiannessStatisticsEval(), UTF16_commonCharacterCount())
    countHigherValueIndex(UTF16_EndiannessStatisticsEval(), UTF16_surrogateEndianness())
    If isNotUTF8 = #False And nonASCII_UTF8_Count > 0
      result = #PB_UTF8
    ElseIf isNotUTF32 = #False
      If isUTF32_EndiannessEliminated(#bigEndian) = #False
        result = #PB_UTF32BE ;default
      Else
        result = #PB_UTF32 ;little endian
      EndIf
    ElseIf isNotUTF16 = #False And (UTF16_EndiannessStatisticsEval(#bigEndian) + UTF16_EndiannessStatisticsEval(#littleEndian)) > 0; And (UTF16_commonCharacterCount / length) > 0.05 ;percentage is arbitrary
      
      If isUTF16_EndiannessEliminated(#bigEndian) = #True
        result = #PB_Unicode
      ElseIf isUTF16_EndiannessEliminated(#littleEndian) = #True
        result = #PB_UTF16BE
      Else
        ;still undecided, examine statistics to settle it
        If UTF16_EndiannessStatisticsEval(#bigEndian) < UTF16_EndiannessStatisticsEval(#littleEndian)
          result = #PB_Unicode    
        Else
          result = #PB_UTF16BE ;default
        EndIf
      EndIf
      
    ElseIf isNotASCII = #False
      result = #PB_Ascii
    Else
      result = #encodingUndetermined ;is either binary type (because of detected nulls) or is non-determined
    EndIf
    
    ProcedureReturn result
  EndProcedure
  
  Procedure.i detectTextEncodingInFile(filename.s, length = -1, flags = 0)
    ;If detection process was not able to start successfully,
    ;the value -1 will be returned.  If detection was attempted
    ;the return value will be the same as those for detectTextEncodingInBuffer().
    Protected *buffer, result, fileID, lof
    
    result = #err_detectionProcessFailedToStart ;this will be changed if successful
    fileID = ReadFile(#PB_Any, filename)
    If fileID
      lof = Lof(fileID)
      If lof <> 0 
        If lof < length Or length = -1
          length = lof
        EndIf
        
        *buffer = AllocateMemory(length)
        If *buffer
          If ReadData(fileID, *buffer, length)
            result = detectTextEncodingInBuffer(*buffer, length, flags)
          EndIf
        EndIf
        
        FreeMemory(*buffer)
      Else
        result = #err_emptyFile
      EndIf
      
      CloseFile(fileID)
    EndIf
    
    ProcedureReturn result
  EndProcedure
  
  Procedure detectTextEncodingOfBuffers(Array *buffers(1), Array results(1), length, flags = 0)
    Protected bufferCount = ArraySize(*buffers()), i
    Dim results(bufferCount)
    
    For i = 0 To bufferCount
      If *buffers(i)
        results(i) = detectTextEncodingInBuffer(*buffers(i), length, flags)
      EndIf
    Next
  EndProcedure
  
  Procedure detectTextEncodingOfFiles(Array filenames.s(1), Array results(1), length = -1, flags = 0)
    Protected fileCount = ArraySize(filenames()), i
    Dim results(fileCount)
    
    For i = 0 To fileCount
      If filenames(i) <> ""
        results(i) = detectTextEncodingInFile(filenames(i), length, flags)
      EndIf
    Next
  EndProcedure
   
EndModule

@Edit: Remove some unneeded comments at the end of the file. :oops:
@Edit: Updated with modification to check for empty files and return a specific error result. Thanks to RASHAD for reminding me. I noticed this in my tests but hadn't yet addressed it. A related issue would be for files that are too short or an incorrect length. This would pertain to all encodings that require more than one byte per codepoint. I'll see what can be done.
Last edited by Demivec on Mon Dec 28, 2015 9:12 pm, edited 4 times in total.
User avatar
Demivec
Addict
Addict
Posts: 4086
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Detecting Text File Encoding without BOM

Post by Demivec »

Here is some demonstration code to test out the previous module include. I had some help from Amilcar Matos Pérez who helped layout the format of the code.

File EBODetector.pb:

Code: Select all

;{- Program header
;==Code Header Comment==============================
;        Name/title: EBODetector.pb
;   Executable name: EBODetector.exe
;           Version: 1.01
;            Author: Demivec
;     Collaborators: Amílcar Matos Pérez
;    Translation by: 
;       Create date: 24/Dec/2015
; Previous releases: 
;Most recent update: 
; Release date/hour: 
;  Operating system: Windows  [X]GUI
;  Compiler version: PureBasic 5.41 (x64)
;         Copyright: (C)2015 AMP All rights reserved.
;           License: Free to use/abuse/modify.
;                   // THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
;                   // ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
;                   // IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
;                   // ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
;                   // FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
;                   // DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
;                   // OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
;                   // HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
;                   // LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
;                   // OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
;                   // SUCH DAMAGE.
;         Libraries: 
;             Forum: http://www.purebasic.fr/english/viewtopic.php?f=13&t=64180&sid=ec0e244cfabf06876bfd82d3f709cc1c
;                    http://www.purebasic.fr/english/viewtopic.php?f=12&t=64385
;  Tested platforms: Windows
;       Explanation: Demonstration of how to use dte module to detect the text file encoding
;                    by examining the file content.
; ==================================================
; v1.1 Added detection of zero-length files and set them to generate an error result.
;.......10........20........30........40........50........60........70........80
;}
EnableExplicit
XIncludeFile "AutoDetectTextEncoding.pbi"

;{ Declare procedures
Declare.i SetCursorPosition(StringGadgetID, CursorPosition) ;credit to Shardik for this multi-platform code
Declare.i BrowseProcedure(Array filenames.s(1))
Declare.i ClearWindowDataEntryFields (Window_EBO)
Declare OpenWindow_EBO (x = 0, y = 0, width = 600, height = 440)
Declare.i Window_Events (Event_EBO)
;}

;{ Variable exposure stmts
Global Window_EBO

Global Frame_0, FilenameStr_EBO, BrowseButton_EBO
Global Frame_1, DetectButton_EBO, EnforceSurrogatesCheckBox_EBO
Global RestrictSurrogatesCheckBox_EBO, DisallowNullsCheckBox_EBO
Global Frame_2, ResultsEditor_EBO,ClipBoardButton_EBO, ClearButton_EBO
Global ExitButton_EBO
;}

Enumeration FormFont
  #Font_Window_EBO_0
EndEnumeration

LoadFont(#Font_Window_EBO_0,"Consolas", 10)

OpenWindow_EBO()
ClearWindowDataEntryFields(Window_EBO)

Define Event_EBO, Quit_EBO
Repeat
  Event_EBO = WaitWindowEvent()
  Quit_EBO  = Window_Events(Event_EBO)
Until Quit_EBO = 0

End 

Procedure.i Window_Events(Event_EBO)
  Static multipleFilesSelected = 0 ;contains number of selected files if >1 or zero
  Static PreviousContentsOfFilenameStr$ = ""
  Static Dim filenames.s(0)
  Protected results, flags, i, j, count, files$
  Dim encodingResults(0)
  
  Select Event_EBO
    Case #PB_Event_CloseWindow
      ProcedureReturn #False
      
    Case #PB_Event_GadgetDrop      ;{- Accept file list via a GadgetDrop
      If EventGadget() = FilenameStr_EBO
        files$ = EventDropFiles()
        count  = CountString(files$, Chr(10)) + 1
        Dim filenames.s(count - 1)
        For i = 1 To count
          filenames(i - 1) = StringField(files$, i, Chr(10))
        Next
        
        If count = 1
          SetGadgetText(FilenameStr_EBO, filenames(0))
          multipleFilesSelected = 0
        ElseIf count > 1
          multipleFilesSelected = count
          SetGadgetText(FilenameStr_EBO, "***** " + count + " Files Selected *****")
        EndIf
        
        SetActiveGadget(DetectButton_EBO)
      EndIf                            ;}
      
    Case #PB_Event_Gadget
      Select EventGadget()        
          
        Case ExitButton_EBO        ;{- Exit"
          ProcedureReturn #False      ;}        
          
        Case ClearButton_EBO       ;{- Clear Data Entry Fields
          ClearWindowDataEntryFields(Window_EBO) ;} 
          
        Case FilenameStr_EBO       ;{- Handle differences between multiple files and a single file
          Select EventType()
            Case #PB_EventType_Focus
              PreviousContentsOfFilenameStr$ = GetGadgetText(FilenameStr_EBO)
            Case #PB_EventType_Change
              If multipleFilesSelected
                files$ = GetGadgetText(FilenameStr_EBO)
                i = 1
                j = 1
                While Left(files$, i) = Left(PreviousContentsOfFilenameStr$, i)
                  i + 1
                Wend
                
                While Right(files$, j) = Right(PreviousContentsOfFilenameStr$, j)
                  j + 1
                Wend
                
                SetGadgetText(FilenameStr_EBO, Mid(files$, i, Len(files$) - j - i + 2))
                SetCursorPosition(FilenameStr_EBO, j - i + 2) ;set cursor after text
              EndIf
              
              PreviousContentsOfFilenameStr$ = GetGadgetText(FilenameStr_EBO)
              
              multipleFilesSelected = 0
              Dim filenames.s(0)    ;zero contents
              Dim encodingResults(0)
          EndSelect                   ;} 
          
        Case BrowseButton_EBO      ;{- Browse for file to test.
          results = BrowseProcedure(filenames())
          If results = 1
            multipleFilesSelected = 0
            SetGadgetText(FilenameStr_EBO, filenames(0))
          ElseIf results > 1
            multipleFilesSelected = results
            SetGadgetText(FilenameStr_EBO, "***** " + results + " Files Selected *****")
          EndIf                       ;}  
          
        Case ClipBoardButton_EBO   ;{- Copy the results to the clipboard.
          SetClipboardText(GetGadgetText(ResultsEditor_EBO)) ;} 
          
        Case DetectButton_EBO      ;{- Detect Encoding and Byte Order
          flags = Bool(GetGadgetState(DisallowNullsCheckBox_EBO) = #PB_Checkbox_Checked) * dte::#disallowNulls +
                  Bool(GetGadgetState(RestrictSurrogatesCheckBox_EBO) = #PB_Checkbox_Checked) * dte::#restrictSurrogateCodesToUTF16 +
                  Bool(GetGadgetState(EnforceSurrogatesCheckBox_EBO) = #PB_Checkbox_Checked) * dte::#enforceProperSurrogateCodes
          
          If multipleFilesSelected
            dte::detectTextEncodingOfFiles(filenames(), encodingResults(), -1, flags)
            For i = 0 To multipleFilesSelected - 1
              AddGadgetItem(ResultsEditor_EBO, -1, filenames(i))
              AddGadgetItem(ResultsEditor_EBO, -1, "Encoding detection results are: '" + dte::textForDetectedStringFormat(encodingResults(i)) + "'.")
            Next
          Else
            AddGadgetItem(ResultsEditor_EBO, -1, GetGadgetText(FilenameStr_EBO))
            results = dte::detectTextEncodingInFile(GetGadgetText(FilenameStr_EBO), -1, flags)
            AddGadgetItem(ResultsEditor_EBO, -1, "Encoding detection results are: '" + dte::textForDetectedStringFormat(results) + "'.")
          EndIf                         ;}
          
      EndSelect
  EndSelect
  
  ProcedureReturn #True
EndProcedure

Procedure.i ClearWindowDataEntryFields(Window_EBO)
  ;{- Procedure explanation
  ; To blank the screen data entry fields.
  ;}
  SetGadgetText(FilenameStr_EBO  , #Null$)  
  SetGadgetText(ResultsEditor_EBO, #Null$)  
  SetActiveGadget(FilenameStr_EBO)
  
  ProcedureReturn #True
EndProcedure

Procedure.i BrowseProcedure(Array filenames.s(1))
  ;{- Procedure explanation
  ; To ease the file selection task.
  ;}
  Protected StandardFile$, Filename$, Pattern$, Pattern, i
  NewList file.s()
  
  StandardFile$ = "C:\"   ; set initial file+path to display
                          ; With next string we will set the search patterns ("|" as separator) for file displaying:
                          ;  1st: "Text (*.txt)" as name, ".txt" and ".bat" as allowed extension
                          ;  2nd: "PureBasic (*.pb)" as name, ".pb" and ".pbi" and ".pbf" as allowed extension
                          ;  3rd: "All files (*.*) as name, "*.*" as allowed extension, valid for all files
  Pattern$ = "Text (*.txt)|*.txt;*.bat|PureBasic (*.pb)|*.pb;*.pbi;*.pbf|All files (*.*)|*.*"
  Pattern  = 0    ; use the first of the three possible patterns as standard
  Filename$    = OpenFileRequester("Please choose file to test", StandardFile$, Pattern$, Pattern, #PB_Requester_MultiSelection)
  
  If Filename$ <> ""
    While Filename$
      AddElement(file())
      file() = Filename$
      Filename$ = NextSelectedFileName() 
    Wend 
    
    Dim filenames(ListSize(file()))
    ForEach file()
      filenames(i) = file()
      i + 1
    Next
  EndIf
    
  ProcedureReturn i ;number of files selected
EndProcedure

Procedure SetCursorPosition(StringGadgetID, CursorPosition) ;sets position of cursor in a StringGadget
  SetActiveGadget(StringGadgetID)

  CompilerSelect #PB_Compiler_OS
    CompilerCase #PB_OS_Windows
      SendMessage_(GadgetID(StringGadgetID), #EM_SETSEL, CursorPosition, CursorPosition)
    CompilerCase #PB_OS_Linux
      gtk_editable_set_position_(GadgetID(StringGadgetID), CursorPosition)
    CompilerCase #PB_OS_MacOS
      Protected Range.NSRange
      Protected TextView.I

      Range\location = CursorPosition
      Range\length = 0
      TextView = CocoaMessage(0, GadgetID(StringGadgetID), "currentEditor")

      If TextView
        CocoaMessage(0, TextView, "setSelectedRange:@", @Range)
      EndIf
  CompilerEndSelect
EndProcedure

Procedure OpenWindow_EBO(x = 0, y = 0, width = 600, height = 440)
  Window_EBO = OpenWindow(#PB_Any, x, y, width, height, "Encoding/Byte Order Detector of text files",
                          #PB_Window_SystemMenu | #PB_Window_MinimizeGadget | #PB_Window_ScreenCentered)
  
  CreateStatusBar(0, WindowID(Window_EBO))
  AddStatusBarField(150)
  StatusBarText(0, 0, "(c) 2015 Demivec")
  
  Frame_0 = FrameGadget(#PB_Any, 20, 20, 560, 50, "Filename") 
  FilenameStr_EBO   = StringGadget(#PB_Any, 30, 40 , 470, 20, #Null$)
  SetGadgetFont(FilenameStr_EBO  , FontID(#Font_Window_EBO_0)) 
  BrowseButton_EBO    = ButtonGadget(#PB_Any, 510,  35,  60, 30, "Browse")
  EnableGadgetDrop(FilenameStr_EBO, #PB_Drop_Files, #PB_Drag_Copy)
  GadgetToolTip(FilenameStr_EBO, "Enter, Select, or Drop files here to attempt encoding detection.")
  
  Frame_1 = FrameGadget(#PB_Any, 20, 80, 560, 90, "Detection Settings") 
  DisallowNullsCheckBox_EBO = CheckBoxGadget(#PB_Any, 100, 100, 200, 20, "Disallow Nulls")
  RestrictSurrogatesCheckBox_EBO = CheckBoxGadget(#PB_Any, 100, 120, 200, 20, "Restrict Surrogate Codes to UTF-16")
  EnforceSurrogatesCheckBox_EBO = CheckBoxGadget(#PB_Any, 100, 140, 200, 20, "Enforce Proper Surrogate Codes")
  DetectButton_EBO    = ButtonGadget(#PB_Any, 510, 110, 60, 30, "Detect")  
  
  Frame_2 = FrameGadget(#PB_Any, 20, 180, 560, 220, "Results" )  
  ResultsEditor_EBO = EditorGadget(#PB_Any, 30, 200, 470, 190)
  SetGadgetFont(ResultsEditor_EBO, FontID(#Font_Window_EBO_0))
  ClipBoardButton_EBO = ButtonGadget(#PB_Any, 510, 200,  60, 30, "Copy To Clipboard", #PB_Button_MultiLine)   
  ClearButton_EBO     = ButtonGadget(#PB_Any, 510, 240, 60, 30, "Clear" ) 
  ExitButton_EBO      = ButtonGadget(#PB_Any, 510, 360, 60 , 30, "Exit"  )
EndProcedure

I ran a hardcore test by scanning 15231 files totaling 231 Mbytes and it took about 4 minutes and used up 30% of my CPU the entire time. As a result I think the code could use some additional improvements such as perhaps creating threads if acting on a group of files (and maybe a progress bar :) ).
Last edited by Demivec on Sat Dec 26, 2015 12:22 pm, edited 1 time in total.
User avatar
Demivec
Addict
Addict
Posts: 4086
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Detecting Text File Encoding without BOM

Post by Demivec »

This space reserved.
User avatar
Demivec
Addict
Addict
Posts: 4086
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Detecting Text File Encoding without BOM

Post by Demivec »

This space reserved.
RASHAD
PureBasic Expert
PureBasic Expert
Posts: 4636
Joined: Sun Apr 12, 2009 6:27 am

Re: Detecting Text File Encoding without BOM

Post by RASHAD »

Hi Demivec and happy new year

It looks promising
1- Good interface design
2- Tested a little bit so far so good

But if the file size = 0 must be taking care of

Thanks for sharing
Egypt my love
Fred
Administrator
Administrator
Posts: 16618
Joined: Fri May 17, 2002 4:39 pm
Location: France
Contact:

Re: Detecting Text File Encoding without BOM

Post by Fred »

Nice work
User avatar
Keya
Addict
Addict
Posts: 1891
Joined: Thu Jun 04, 2015 7:10 am

Re: Detecting Text File Encoding without BOM

Post by Keya »

clever work Demi! :)
btw just to hijack your thread i have a related function to complement yours for detecting the BOM type when there is one - http://purebasic.fr/english/viewtopic.php?f=12&t=63341
have a great Christmas!
User avatar
Kwai chang caine
Always Here
Always Here
Posts: 5342
Joined: Sun Nov 05, 2006 11:42 pm
Location: Lyon - France

Re: Detecting Text File Encoding without BOM

Post by Kwai chang caine »

Cool and very usefull
Thanks for sharing 8)
ImageThe happiness is a road...
Not a destination
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Detecting Text File Encoding without BOM

Post by Little John »

Thank you so much for sharing, Demivec!
That's basic (no pun intended) and great. Image
I always wanted to have something like that.

In the module dte, it reads:

Code: Select all

Procedure.i detectTextEncodingInBuffer(*buffer, length, flags = 0)
  CallDebugger
Is this intended?
I got your demo code running flawlessly only after deactivating CallDebugger in the module.
User avatar
Demivec
Addict
Addict
Posts: 4086
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Detecting Text File Encoding without BOM

Post by Demivec »

Little John wrote:In the module dte, it reads:

Code:
Procedure.i detectTextEncodingInBuffer(*buffer, length, flags = 0)
  CallDebugger

Is this intended?
Well, no. I need to remove that. :oops:
User avatar
Andre
PureBasic Team
PureBasic Team
Posts: 2056
Joined: Fri Apr 25, 2003 6:14 pm
Location: Germany (Saxony, Deutscheinsiedel)
Contact:

Re: Detecting Text File Encoding without BOM

Post by Andre »

Even if I couldn't do further testing yet, it looks like a very useful code to me... :D
Bye,
...André
(PureBasicTeam::Docs & Support - PureArea.net | Order:: PureBasic | PureVisionXP)
User avatar
Dreamland Fantasy
Enthusiast
Enthusiast
Posts: 335
Joined: Fri Jun 11, 2004 9:35 pm
Location: Glasgow, UK
Contact:

Re: Detecting Text File Encoding without BOM

Post by Dreamland Fantasy »

Nice work Demivec, and exactly just what I was looking for! :D

Kind regards,

Francis
walbus
Addict
Addict
Posts: 929
Joined: Sat Mar 02, 2013 9:17 am

Re: Detecting Text File Encoding without BOM

Post by walbus »

A very helpfull code !
Post Reply