Detecting Text File Encoding without BOM
Detecting Text File Encoding without BOM
Here is a detection routine for detecting the encoding (unicode types or ASCII) and byte order for text files.
It may need some improvements. If there are any suggestions or comments I'll see what can be done to reasonably incorporate them.
The code does not detect codepages for extended ASCII character codes.
It will attempt to determine the same encoding types as those returned by PureBasic's ReadStringFormat() command which looks for a BOM at the beginning of a file.
I will post the code in the next two messages and leave a blank message for any additional things that may be needed to be posted at the beginning of the thread in the future.
Please test the code and let me know if certain things are useful or useless. It's good to know either way.
It may need some improvements. If there are any suggestions or comments I'll see what can be done to reasonably incorporate them.
The code does not detect codepages for extended ASCII character codes.
It will attempt to determine the same encoding types as those returned by PureBasic's ReadStringFormat() command which looks for a BOM at the beginning of a file.
I will post the code in the next two messages and leave a blank message for any additional things that may be needed to be posted at the beginning of the thread in the future.
Please test the code and let me know if certain things are useful or useless. It's good to know either way.
Re: Detecting Text File Encoding without BOM
First a module include called AutoDetectTextEncoding.pbi:
@Edit: Remove some unneeded comments at the end of the file.
@Edit: Updated with modification to check for empty files and return a specific error result. Thanks to RASHAD for reminding me. I noticed this in my tests but hadn't yet addressed it. A related issue would be for files that are too short or an incorrect length. This would pertain to all encodings that require more than one byte per codepoint. I'll see what can be done.
Code: Select all
;{- Program header
;==Code Header Comment==============================
; Name/title: AutoDetectTextEncoding.pbi
; Version: 1.0
; Author: Demivec
; Create date: 24/Dec/2015
; Operating system: Windows [X]GUI
; Compiler version: PureBasic 5.41 (x64)
; License: Free to use/abuse/modify.
; // THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
; // ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
; // IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
; // ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
; // FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
; // DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
; // OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
; // HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
; // LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
; // OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
; // SUCH DAMAGE.
; Forum: http://www.purebasic.fr/english/viewtopic.php?f=13&t=64180&sid=ec0e244cfabf06876bfd82d3f709cc1c
; http://www.purebasic.fr/english/viewtopic.php?f=12&t=64385
; Tested platforms: Windows
; Explanation: To detect the file encoding by examining the file content.
; Includes detection of Unicode (UTF-8, UTF-16, UTF-32) big and little endian encodings.
; UTF-16 also includes UCS-2. Does not include detection of code page for ASCII.
; ==================================================
;}
;module DetectTextEncoding
DeclareModule dte
EnableExplicit
;flags for use with detect encoding functions
EnumerationBinary
#disallowNulls ;don't allow Nulls to be in ASCII or a unicode encoding (UTF-8, UTF-16, and UTF-32)
#enforceProperSurrogateCodes ;surrogate codes need to be paired in a proper lead/tail fashion
#restrictSurrogateCodesToUTF16 ;allow surrogate codes only in UTF-16 encodings
EndEnumeration
;constants for custom decoding results, usually errors
#err_detectionProcessFailedToStart = -1 ;error given for memmory or filename errors in detection process
#err_emptyFile = -2 ;error given for an empty file
#encodingUndetermined = 0
#maxLength = $10000 ;64k is the maximum to read from a file for detection purposes
Declare.i isNativePureBasicFormat(type) ;Returns #True if 'type' is a name PureBasic format
Declare.s textForDetectedStringFormat(type) ;translates a detection result into text
;Each of the detection functions returns a constant for the type if successful, #encodingUndetermined if
;not successful, or #err_detectionProcessFailedToStart if an error was encountered.
Declare.i detectTextEncodingInBuffer(*buffer, length, flags = 0) ;returns type
Declare.i detectTextEncodingInFile(filename.s, length = -1, flags = 0) ;returns type
Declare detectTextEncodingOfBuffers(Array *buffers(1), Array results(1), length, flags = 0) ;returns results in an array
Declare detectTextEncodingOfFiles(Array filenames.s(1), Array results(1), length = -1, flags = 0) ;returns results in an array
EndDeclareModule
Module dte
Enumeration
#bigEndian = 0
#littleEndian = 1
EndEnumeration
Procedure isNativePureBasicFormat(type) ;Returns #True if 'type' is a name PureBasic format
Protected result
Select type
Case #PB_Ascii, #PB_Unicode, #PB_UTF8
result = #True
Default ;includes #PB_UTF16BE, #PB_UTF32, #PB_UTF32BE
result = #False
EndSelect
ProcedureReturn result
EndProcedure
Procedure.s textForDetectedStringFormat(type) ;translates a detection result into text
Protected result.s
Select type
Case #PB_Ascii
result = "ASCII"
Case #PB_Unicode
result = "Unicode"
Case #PB_UTF8
result = "UTF-8"
Case #PB_UTF16BE
result = "UTF-16BE"
Case #PB_UTF32
result = "UTF-32"
Case #PB_UTF32BE
result = "UTF-32BE"
Case #encodingUndetermined
result = "Undetermined"
Case #err_emptyFile
result = "File is empty"
Case #err_detectionProcessFailedToStart
result = "Detection process failed to start"
EndSelect
ProcedureReturn result
EndProcedure
Procedure countHigherValueIndex(Array s(1), Array n(1))
;increment the index of s() that corresponds to the element of n() that is the highest
If n(0) > n(1)
s(0) + 1
ElseIf n(1) > n(0)
s(1) + 1
EndIf
EndProcedure
Procedure.i detectTextEncodingInBuffer(*buffer, length, flags = 0)
;The return value will be one of the BOM for string format types listed
;in PureBasic's ReadStringFormat() function or it will return 0
;for an undetermined format.
Protected *bufferPtr.Ascii = *buffer, *endPtr = *buffer + length - 1
Protected isNotASCII, isNotUTF8, isNotUTF16, isNotUTF32
Protected remainingUTF8_ByteCount, nonASCII_UTF8_Count, workingByte
Protected byteParity_2, UTF8_surrogateCheckStatus
Protected byteParity_4, UTF16_commonCharacterCount, i, result
Protected disallowNulls = flags & #disallowNulls
Protected enforceSurrogates = flags & #enforceProperSurrogateCodes
Protected restrictSurrogates = flags & #restrictSurrogateCodesToUTF16
Dim UTF16_nullCount(1) ;count of nulls in high bytes for each endianness
Dim isUTF16_EndiannessEliminated(1) ;boolean values
Dim UTF16_surrogateEndianness(1) ;count of surrogates for UTF-16 endianness detection
Dim UTF16_leadSurrogateAddress(1) ;buffer address at which leadSurrogate was found for each endianness
Dim UTF16_codePoint.u(1) ;complete code point interpretation for each endianness
Dim UTF16_commonCharacterCount(1) ;count of common code points (i.e. for Space, Tab, CR, LF) for each endiannness
Dim UTF16_EndiannessStatisticsEval(2) ;sum of endianness probability, indexes {0,1} relate to each endianness
Dim UTF32_codePoint.l(1) ;complete code point interpretation for each endianness
Dim isUTF32_EndiannessEliminated(1) ;boolean values
Dim UTF32_surrogateEndianness(1) ;count of surrogates for UTF-32 endianness detection
Dim UTF32_leadSurrogateAddress(1) ;buffer address at which leadSurrogate was found for each endianness
If Not *buffer
ProcedureReturn #err_detectionProcessFailedToStart
EndIf
While *bufferPtr <= *endPtr
If isNotASCII = #False
If *bufferPtr\a = 0 And disallowNulls
isNotASCII = #True
EndIf
EndIf
If isNotUTF8 = #False
;This encoding is ruled out if a mismatch of surrogate pairs is found or a Null is found
;when it is not allowed.
;Invalid code points (according to Wikipedia)
;According to the UTF-8 definition (RFC 3629) the high and low surrogate
;halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode
;values, and their UTF-8 encoding should be treated as an invalid byte
;sequence.
;
;Whether an actual application should do this is debatable, as it makes it
;impossible to store invalid UTF-16 (that is, UTF-16 with unpaired
;surrogate halves) in a UTF-8 string. This is necessary to store unchecked
;UTF-16 such as Windows filenames as UTF-8. It is also incompatible with
;CESU-8 encoding (described below).
;
;Many programs added UTF-8 conversions for UCS-2 data and did not alter
;this UTF-8 conversion when UCS-2 was replaced with the surrogate-pair
;using UTF-16. In such programs each half of a UTF-16 surrogate pair is
;encoded as its own 3-byte UTF-8 encoding, resulting in 6-byte sequences
;rather than 4 bytes for characters outside the Basic Multilingual Plane.
;Oracle and MySQL databases use this, as well as Java and Tcl as described
;below, and probably many Windows programs where the programmers were
;unaware of the complexities of UTF-16. Although this non-optimal encoding
;is generally not deliberate, a supposed benefit is that it preserves
;UTF-16 binary sorting order when CESU-8 is binary sorted.
Select *bufferPtr\a & %11000000
Case %00000000 ;ASCII byte
If remainingUTF8_ByteCount > 0 ;in the middle of a multi-byte code point
isNotUTF8 = #True
remainingUTF8_ByteCount = 0 ;start over
EndIf
If *bufferPtr\a = 0 And disallowNulls
isNotUTF8 = #True
EndIf
Case %10000000 ;continuation byte of a multi-byte code point
If remainingUTF8_ByteCount = 0
isNotUTF8 = #True ;error, not looking for any more bytes of a multi-byte code point
Else
remainingUTF8_ByteCount - 1
If enforceSurrogates
If remainingUTF8_ByteCount = 1 And UTF8_surrogateCheckStatus <> 0
;compare second byte of multi-byte code point to check lead/tail ordering of surrogates
Select *bufferPtr\a & %10110000
Case %10100000 ;lead surrogate
If UTF8_surrogateCheckStatus <> 1
isNotUTF8 = #True ;error, tail surrogate should come after previous lead surrogate
Else
UTF8_surrogateCheckStatus = 2
EndIf
Case %10110000 ;tail surrogate
If UTF8_surrogateCheckStatus <> 3
isNotUTF8 = #True ;error, tail surrogate came before lead surrogate
Else
UTF8_surrogateCheckStatus = 0 ;proper pairing, reset
EndIf
EndSelect
EndIf
EndIf
If remainingUTF8_ByteCount = 0
nonASCII_UTF8_Count + 1
EndIf
EndIf
Case %11000000 ;start byte of multi-byte code point (1 - 3 more bytes)
If remainingUTF8_ByteCount
isNotUTF8 = #True ;error, still looking for more bytes of a multi-byte code point
Else
;calculate number of remaining bytes
workingByte = *bufferPtr\a << 1
remainingUTF8_ByteCount = 0
While workingByte & %10000000 > 0
remainingUTF8_ByteCount + 1
workingByte << 1
Wend
;reject UTF-8 sequences (>4 bytes) and also ones that carry no payload in the first byte
If remainingUTF8_ByteCount > 3 Or workingByte = 0
isNotUTF8 = #True ;error for an overlong sequence
EndIf
If *bufferPtr\a = $ED ;byte starts a sequence for a surrogate encoding
If restrictSurrogates
isNotUTF8 = #True ;error, invalid
EndIf
UTF8_surrogateCheckStatus + 1 ;advance status to check next byte for lead/tail code
EndIf
EndIf
EndSelect
EndIf
If isNotUTF16 = #False
byteParity_2 = (*bufferPtr - *buffer) % 2 ;byte parity of the offset will equal (1) on a completed code point boundary
If *bufferPtr\a = 0
UTF16_nullCount(byteParity_2) + 1
EndIf
;build complete code points to test
UTF16_codePoint(#bigEndian) = UTF16_codePoint(#bigEndian) << 8 + *bufferPtr\a
UTF16_codePoint(#littleEndian) = UTF16_codePoint(#littleEndian) >> 8 + *bufferPtr\a << 8
If byteParity_2 = 1 ;we're on a code point boundary (byte parity of the offset is 'odd')
For i = #bigEndian To #littleEndian
If isUTF16_EndiannessEliminated(i) = #False
Select UTF16_codePoint(i)
Case $0020, $000A, $000D, $0009 ;space, LF, CR, Tab
UTF16_commonCharacterCount(i) + 1
Case $0000 ;#Null
If disallowNulls
isNotUTF16 = #True
EndIf
Case $D800 To $DBFF ;lead surrogate
If enforceSurrogates
If UTF16_leadSurrogateAddress(i)
isUTF16_EndiannessEliminated(i) = #True
If isUTF16_EndiannessEliminated(i ! 1)
isNotUTF16 = #True
EndIf
EndIf
EndIf
Case $DC00 To $DFFF ;tail surrogate
If enforceSurrogates
If UTF16_leadSurrogateAddress(i) = *bufferPtr - SizeOf(Unicode)
UTF16_surrogateEndianness(i) + 1
UTF16_leadSurrogateAddress(i) = 0 ;reset value
Else
isUTF16_EndiannessEliminated(i) = #True
If isUTF16_EndiannessEliminated(i ! 1)
isNotUTF16 = #True
EndIf
EndIf
EndIf
EndSelect
EndIf
Next
EndIf
EndIf
If isNotUTF32 = #False
;Because surrogate code points are not included in the set of Unicode scalar values,
;UTF-32 code units in the range $0000D800..$0000DFFF are ill-formed.
;Any UTF-32 code unit greater than $0010FFFF is ill-formed.
;
;UTF-32 is forbidden from storing the non-character code points that are illegal for
;interchange, such as 0xFFFF, 0xFFFE, and the all the surrogates.
;UTF is a transport encoding, not an internal one.
;
;According to stackoverflow.com:
;But UTF-32 is easy to detect even without a BOM. This is because the
;Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units
;always have the pattern 00 {0x|10} xx xx (for BE) or xx xx {0x|10} 00 (for
;LE). If the data has a length that's a multiple of 4, and follows one of
;these patterns, you can safely assume it's UTF-32. False positives are
;nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.
byteParity_4 = (*bufferPtr - *buffer) % 4 ;byte parity of the offset will be 3 on a completed code point boundary
Select byteParity_4
Case 0
If *bufferPtr\a <> 0 And isUTF32_EndiannessEliminated(#bigEndian) = #False
isUTF32_EndiannessEliminated(#bigEndian) = #True
If isUTF32_EndiannessEliminated(#littleEndian)
isNotUTF32 = #True
EndIf
EndIf
Case 1
If *bufferPtr\a > $10 And isUTF32_EndiannessEliminated(#bigEndian) = #False
isUTF32_EndiannessEliminated(#bigEndian) = #True
If isUTF32_EndiannessEliminated(#littleEndian)
isNotUTF32 = #True
EndIf
EndIf
Case 2
If *bufferPtr\a > $10 And isUTF32_EndiannessEliminated(#littleEndian) = #False
isUTF32_EndiannessEliminated(#littleEndian) = #True
If isUTF32_EndiannessEliminated(#bigEndian)
isNotUTF32 = #True
EndIf
EndIf
Case 3
If *bufferPtr\a <> 0 And isUTF32_EndiannessEliminated(#littleEndian) = #False
isUTF32_EndiannessEliminated(#littleEndian) = #True
If isUTF32_EndiannessEliminated(#bigEndian)
isNotUTF32 = #True
EndIf
EndIf
EndSelect
;build complete code points to test
UTF32_codePoint(#bigEndian) = UTF32_codePoint(#bigEndian) << 8 + *bufferPtr\a
UTF32_codePoint(#littleEndian) = (UTF32_codePoint(#littleEndian) >> 8) & $00FFFFFF
UTF32_codePoint(#littleEndian) + *bufferPtr\a << 24
If byteParity_4 = 3 ;we're on a code point boundary
For i = #bigEndian To #littleEndian
If isUTF32_EndiannessEliminated(i) = #False
Select UTF32_codePoint(i)
Case $00000000 ;#Null
If disallowNulls
isNotUTF32 = #True
EndIf
Case $0000D800 To $0000DFFF ;lead surrogate and tail surrogates for UTF16
If restrictSurrogates
isUTF32_EndiannessEliminated(i) = #True ;error, found a surrogate code
If isUTF32_EndiannessEliminated(i ! 1)
isNotUTF32 = #True
EndIf
ElseIf enforceSurrogates
;need to test for lead/tail ordering and unmatched pairs
If UTF32_codePoint(i) <$0000DC00 ;lead surrogate
If UTF32_leadSurrogateAddress(i)
isUTF32_EndiannessEliminated(i) = #True
If isUTF32_EndiannessEliminated(i ! 1)
isNotUTF32 = #True
EndIf
EndIf
Else ;tail surrogate
If UTF32_leadSurrogateAddress(i) = *bufferPtr - SizeOf(Unicode)
UTF32_surrogateEndianness(i) + 1
UTF32_leadSurrogateAddress(i) = 0 ;reset value
Else
isUTF32_EndiannessEliminated(i) = #True
If isUTF32_EndiannessEliminated(i ! 1)
isNotUTF32 = #True
EndIf
EndIf
EndIf
EndIf
Case $0000FFFE To $0000FFFF ;forbidden non-character code points
isUTF32_EndiannessEliminated(i) = #True
If isUTF32_EndiannessEliminated(i ! 1)
isNotUTF32 = #True
EndIf
EndSelect
EndIf
Next
EndIf
EndIf
*bufferPtr + SizeOf(Ascii)
Wend
;Examine statics and determine most likely encoding type.
;Detection order is UTF-8, UTF-32 (both endians), UTF-16 (both endians), ASCII, else undetermined (a.k.a binary).
countHigherValueIndex(UTF16_EndiannessStatisticsEval(), UTF16_nullCount())
countHigherValueIndex(UTF16_EndiannessStatisticsEval(), UTF16_commonCharacterCount())
countHigherValueIndex(UTF16_EndiannessStatisticsEval(), UTF16_surrogateEndianness())
If isNotUTF8 = #False And nonASCII_UTF8_Count > 0
result = #PB_UTF8
ElseIf isNotUTF32 = #False
If isUTF32_EndiannessEliminated(#bigEndian) = #False
result = #PB_UTF32BE ;default
Else
result = #PB_UTF32 ;little endian
EndIf
ElseIf isNotUTF16 = #False And (UTF16_EndiannessStatisticsEval(#bigEndian) + UTF16_EndiannessStatisticsEval(#littleEndian)) > 0; And (UTF16_commonCharacterCount / length) > 0.05 ;percentage is arbitrary
If isUTF16_EndiannessEliminated(#bigEndian) = #True
result = #PB_Unicode
ElseIf isUTF16_EndiannessEliminated(#littleEndian) = #True
result = #PB_UTF16BE
Else
;still undecided, examine statistics to settle it
If UTF16_EndiannessStatisticsEval(#bigEndian) < UTF16_EndiannessStatisticsEval(#littleEndian)
result = #PB_Unicode
Else
result = #PB_UTF16BE ;default
EndIf
EndIf
ElseIf isNotASCII = #False
result = #PB_Ascii
Else
result = #encodingUndetermined ;is either binary type (because of detected nulls) or is non-determined
EndIf
ProcedureReturn result
EndProcedure
Procedure.i detectTextEncodingInFile(filename.s, length = -1, flags = 0)
;If detection process was not able to start successfully,
;the value -1 will be returned. If detection was attempted
;the return value will be the same as those for detectTextEncodingInBuffer().
Protected *buffer, result, fileID, lof
result = #err_detectionProcessFailedToStart ;this will be changed if successful
fileID = ReadFile(#PB_Any, filename)
If fileID
lof = Lof(fileID)
If lof <> 0
If lof < length Or length = -1
length = lof
EndIf
*buffer = AllocateMemory(length)
If *buffer
If ReadData(fileID, *buffer, length)
result = detectTextEncodingInBuffer(*buffer, length, flags)
EndIf
EndIf
FreeMemory(*buffer)
Else
result = #err_emptyFile
EndIf
CloseFile(fileID)
EndIf
ProcedureReturn result
EndProcedure
Procedure detectTextEncodingOfBuffers(Array *buffers(1), Array results(1), length, flags = 0)
Protected bufferCount = ArraySize(*buffers()), i
Dim results(bufferCount)
For i = 0 To bufferCount
If *buffers(i)
results(i) = detectTextEncodingInBuffer(*buffers(i), length, flags)
EndIf
Next
EndProcedure
Procedure detectTextEncodingOfFiles(Array filenames.s(1), Array results(1), length = -1, flags = 0)
Protected fileCount = ArraySize(filenames()), i
Dim results(fileCount)
For i = 0 To fileCount
If filenames(i) <> ""
results(i) = detectTextEncodingInFile(filenames(i), length, flags)
EndIf
Next
EndProcedure
EndModule
@Edit: Remove some unneeded comments at the end of the file.
@Edit: Updated with modification to check for empty files and return a specific error result. Thanks to RASHAD for reminding me. I noticed this in my tests but hadn't yet addressed it. A related issue would be for files that are too short or an incorrect length. This would pertain to all encodings that require more than one byte per codepoint. I'll see what can be done.
Last edited by Demivec on Mon Dec 28, 2015 9:12 pm, edited 4 times in total.
Re: Detecting Text File Encoding without BOM
Here is some demonstration code to test out the previous module include. I had some help from Amilcar Matos Pérez who helped layout the format of the code.
File EBODetector.pb:
I ran a hardcore test by scanning 15231 files totaling 231 Mbytes and it took about 4 minutes and used up 30% of my CPU the entire time. As a result I think the code could use some additional improvements such as perhaps creating threads if acting on a group of files (and maybe a progress bar ).
File EBODetector.pb:
Code: Select all
;{- Program header
;==Code Header Comment==============================
; Name/title: EBODetector.pb
; Executable name: EBODetector.exe
; Version: 1.01
; Author: Demivec
; Collaborators: Amílcar Matos Pérez
; Translation by:
; Create date: 24/Dec/2015
; Previous releases:
;Most recent update:
; Release date/hour:
; Operating system: Windows [X]GUI
; Compiler version: PureBasic 5.41 (x64)
; Copyright: (C)2015 AMP All rights reserved.
; License: Free to use/abuse/modify.
; // THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
; // ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
; // IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
; // ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
; // FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
; // DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
; // OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
; // HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
; // LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
; // OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
; // SUCH DAMAGE.
; Libraries:
; Forum: http://www.purebasic.fr/english/viewtopic.php?f=13&t=64180&sid=ec0e244cfabf06876bfd82d3f709cc1c
; http://www.purebasic.fr/english/viewtopic.php?f=12&t=64385
; Tested platforms: Windows
; Explanation: Demonstration of how to use dte module to detect the text file encoding
; by examining the file content.
; ==================================================
; v1.1 Added detection of zero-length files and set them to generate an error result.
;.......10........20........30........40........50........60........70........80
;}
EnableExplicit
XIncludeFile "AutoDetectTextEncoding.pbi"
;{ Declare procedures
Declare.i SetCursorPosition(StringGadgetID, CursorPosition) ;credit to Shardik for this multi-platform code
Declare.i BrowseProcedure(Array filenames.s(1))
Declare.i ClearWindowDataEntryFields (Window_EBO)
Declare OpenWindow_EBO (x = 0, y = 0, width = 600, height = 440)
Declare.i Window_Events (Event_EBO)
;}
;{ Variable exposure stmts
Global Window_EBO
Global Frame_0, FilenameStr_EBO, BrowseButton_EBO
Global Frame_1, DetectButton_EBO, EnforceSurrogatesCheckBox_EBO
Global RestrictSurrogatesCheckBox_EBO, DisallowNullsCheckBox_EBO
Global Frame_2, ResultsEditor_EBO,ClipBoardButton_EBO, ClearButton_EBO
Global ExitButton_EBO
;}
Enumeration FormFont
#Font_Window_EBO_0
EndEnumeration
LoadFont(#Font_Window_EBO_0,"Consolas", 10)
OpenWindow_EBO()
ClearWindowDataEntryFields(Window_EBO)
Define Event_EBO, Quit_EBO
Repeat
Event_EBO = WaitWindowEvent()
Quit_EBO = Window_Events(Event_EBO)
Until Quit_EBO = 0
End
Procedure.i Window_Events(Event_EBO)
Static multipleFilesSelected = 0 ;contains number of selected files if >1 or zero
Static PreviousContentsOfFilenameStr$ = ""
Static Dim filenames.s(0)
Protected results, flags, i, j, count, files$
Dim encodingResults(0)
Select Event_EBO
Case #PB_Event_CloseWindow
ProcedureReturn #False
Case #PB_Event_GadgetDrop ;{- Accept file list via a GadgetDrop
If EventGadget() = FilenameStr_EBO
files$ = EventDropFiles()
count = CountString(files$, Chr(10)) + 1
Dim filenames.s(count - 1)
For i = 1 To count
filenames(i - 1) = StringField(files$, i, Chr(10))
Next
If count = 1
SetGadgetText(FilenameStr_EBO, filenames(0))
multipleFilesSelected = 0
ElseIf count > 1
multipleFilesSelected = count
SetGadgetText(FilenameStr_EBO, "***** " + count + " Files Selected *****")
EndIf
SetActiveGadget(DetectButton_EBO)
EndIf ;}
Case #PB_Event_Gadget
Select EventGadget()
Case ExitButton_EBO ;{- Exit"
ProcedureReturn #False ;}
Case ClearButton_EBO ;{- Clear Data Entry Fields
ClearWindowDataEntryFields(Window_EBO) ;}
Case FilenameStr_EBO ;{- Handle differences between multiple files and a single file
Select EventType()
Case #PB_EventType_Focus
PreviousContentsOfFilenameStr$ = GetGadgetText(FilenameStr_EBO)
Case #PB_EventType_Change
If multipleFilesSelected
files$ = GetGadgetText(FilenameStr_EBO)
i = 1
j = 1
While Left(files$, i) = Left(PreviousContentsOfFilenameStr$, i)
i + 1
Wend
While Right(files$, j) = Right(PreviousContentsOfFilenameStr$, j)
j + 1
Wend
SetGadgetText(FilenameStr_EBO, Mid(files$, i, Len(files$) - j - i + 2))
SetCursorPosition(FilenameStr_EBO, j - i + 2) ;set cursor after text
EndIf
PreviousContentsOfFilenameStr$ = GetGadgetText(FilenameStr_EBO)
multipleFilesSelected = 0
Dim filenames.s(0) ;zero contents
Dim encodingResults(0)
EndSelect ;}
Case BrowseButton_EBO ;{- Browse for file to test.
results = BrowseProcedure(filenames())
If results = 1
multipleFilesSelected = 0
SetGadgetText(FilenameStr_EBO, filenames(0))
ElseIf results > 1
multipleFilesSelected = results
SetGadgetText(FilenameStr_EBO, "***** " + results + " Files Selected *****")
EndIf ;}
Case ClipBoardButton_EBO ;{- Copy the results to the clipboard.
SetClipboardText(GetGadgetText(ResultsEditor_EBO)) ;}
Case DetectButton_EBO ;{- Detect Encoding and Byte Order
flags = Bool(GetGadgetState(DisallowNullsCheckBox_EBO) = #PB_Checkbox_Checked) * dte::#disallowNulls +
Bool(GetGadgetState(RestrictSurrogatesCheckBox_EBO) = #PB_Checkbox_Checked) * dte::#restrictSurrogateCodesToUTF16 +
Bool(GetGadgetState(EnforceSurrogatesCheckBox_EBO) = #PB_Checkbox_Checked) * dte::#enforceProperSurrogateCodes
If multipleFilesSelected
dte::detectTextEncodingOfFiles(filenames(), encodingResults(), -1, flags)
For i = 0 To multipleFilesSelected - 1
AddGadgetItem(ResultsEditor_EBO, -1, filenames(i))
AddGadgetItem(ResultsEditor_EBO, -1, "Encoding detection results are: '" + dte::textForDetectedStringFormat(encodingResults(i)) + "'.")
Next
Else
AddGadgetItem(ResultsEditor_EBO, -1, GetGadgetText(FilenameStr_EBO))
results = dte::detectTextEncodingInFile(GetGadgetText(FilenameStr_EBO), -1, flags)
AddGadgetItem(ResultsEditor_EBO, -1, "Encoding detection results are: '" + dte::textForDetectedStringFormat(results) + "'.")
EndIf ;}
EndSelect
EndSelect
ProcedureReturn #True
EndProcedure
Procedure.i ClearWindowDataEntryFields(Window_EBO)
;{- Procedure explanation
; To blank the screen data entry fields.
;}
SetGadgetText(FilenameStr_EBO , #Null$)
SetGadgetText(ResultsEditor_EBO, #Null$)
SetActiveGadget(FilenameStr_EBO)
ProcedureReturn #True
EndProcedure
Procedure.i BrowseProcedure(Array filenames.s(1))
;{- Procedure explanation
; To ease the file selection task.
;}
Protected StandardFile$, Filename$, Pattern$, Pattern, i
NewList file.s()
StandardFile$ = "C:\" ; set initial file+path to display
; With next string we will set the search patterns ("|" as separator) for file displaying:
; 1st: "Text (*.txt)" as name, ".txt" and ".bat" as allowed extension
; 2nd: "PureBasic (*.pb)" as name, ".pb" and ".pbi" and ".pbf" as allowed extension
; 3rd: "All files (*.*) as name, "*.*" as allowed extension, valid for all files
Pattern$ = "Text (*.txt)|*.txt;*.bat|PureBasic (*.pb)|*.pb;*.pbi;*.pbf|All files (*.*)|*.*"
Pattern = 0 ; use the first of the three possible patterns as standard
Filename$ = OpenFileRequester("Please choose file to test", StandardFile$, Pattern$, Pattern, #PB_Requester_MultiSelection)
If Filename$ <> ""
While Filename$
AddElement(file())
file() = Filename$
Filename$ = NextSelectedFileName()
Wend
Dim filenames(ListSize(file()))
ForEach file()
filenames(i) = file()
i + 1
Next
EndIf
ProcedureReturn i ;number of files selected
EndProcedure
Procedure SetCursorPosition(StringGadgetID, CursorPosition) ;sets position of cursor in a StringGadget
SetActiveGadget(StringGadgetID)
CompilerSelect #PB_Compiler_OS
CompilerCase #PB_OS_Windows
SendMessage_(GadgetID(StringGadgetID), #EM_SETSEL, CursorPosition, CursorPosition)
CompilerCase #PB_OS_Linux
gtk_editable_set_position_(GadgetID(StringGadgetID), CursorPosition)
CompilerCase #PB_OS_MacOS
Protected Range.NSRange
Protected TextView.I
Range\location = CursorPosition
Range\length = 0
TextView = CocoaMessage(0, GadgetID(StringGadgetID), "currentEditor")
If TextView
CocoaMessage(0, TextView, "setSelectedRange:@", @Range)
EndIf
CompilerEndSelect
EndProcedure
Procedure OpenWindow_EBO(x = 0, y = 0, width = 600, height = 440)
Window_EBO = OpenWindow(#PB_Any, x, y, width, height, "Encoding/Byte Order Detector of text files",
#PB_Window_SystemMenu | #PB_Window_MinimizeGadget | #PB_Window_ScreenCentered)
CreateStatusBar(0, WindowID(Window_EBO))
AddStatusBarField(150)
StatusBarText(0, 0, "(c) 2015 Demivec")
Frame_0 = FrameGadget(#PB_Any, 20, 20, 560, 50, "Filename")
FilenameStr_EBO = StringGadget(#PB_Any, 30, 40 , 470, 20, #Null$)
SetGadgetFont(FilenameStr_EBO , FontID(#Font_Window_EBO_0))
BrowseButton_EBO = ButtonGadget(#PB_Any, 510, 35, 60, 30, "Browse")
EnableGadgetDrop(FilenameStr_EBO, #PB_Drop_Files, #PB_Drag_Copy)
GadgetToolTip(FilenameStr_EBO, "Enter, Select, or Drop files here to attempt encoding detection.")
Frame_1 = FrameGadget(#PB_Any, 20, 80, 560, 90, "Detection Settings")
DisallowNullsCheckBox_EBO = CheckBoxGadget(#PB_Any, 100, 100, 200, 20, "Disallow Nulls")
RestrictSurrogatesCheckBox_EBO = CheckBoxGadget(#PB_Any, 100, 120, 200, 20, "Restrict Surrogate Codes to UTF-16")
EnforceSurrogatesCheckBox_EBO = CheckBoxGadget(#PB_Any, 100, 140, 200, 20, "Enforce Proper Surrogate Codes")
DetectButton_EBO = ButtonGadget(#PB_Any, 510, 110, 60, 30, "Detect")
Frame_2 = FrameGadget(#PB_Any, 20, 180, 560, 220, "Results" )
ResultsEditor_EBO = EditorGadget(#PB_Any, 30, 200, 470, 190)
SetGadgetFont(ResultsEditor_EBO, FontID(#Font_Window_EBO_0))
ClipBoardButton_EBO = ButtonGadget(#PB_Any, 510, 200, 60, 30, "Copy To Clipboard", #PB_Button_MultiLine)
ClearButton_EBO = ButtonGadget(#PB_Any, 510, 240, 60, 30, "Clear" )
ExitButton_EBO = ButtonGadget(#PB_Any, 510, 360, 60 , 30, "Exit" )
EndProcedure
I ran a hardcore test by scanning 15231 files totaling 231 Mbytes and it took about 4 minutes and used up 30% of my CPU the entire time. As a result I think the code could use some additional improvements such as perhaps creating threads if acting on a group of files (and maybe a progress bar ).
Last edited by Demivec on Sat Dec 26, 2015 12:22 pm, edited 1 time in total.
Re: Detecting Text File Encoding without BOM
This space reserved.
Re: Detecting Text File Encoding without BOM
This space reserved.
Re: Detecting Text File Encoding without BOM
Hi Demivec and happy new year
It looks promising
1- Good interface design
2- Tested a little bit so far so good
But if the file size = 0 must be taking care of
Thanks for sharing
It looks promising
1- Good interface design
2- Tested a little bit so far so good
But if the file size = 0 must be taking care of
Thanks for sharing
Egypt my love
Re: Detecting Text File Encoding without BOM
clever work Demi!
btw just to hijack your thread i have a related function to complement yours for detecting the BOM type when there is one - http://purebasic.fr/english/viewtopic.php?f=12&t=63341
have a great Christmas!
btw just to hijack your thread i have a related function to complement yours for detecting the BOM type when there is one - http://purebasic.fr/english/viewtopic.php?f=12&t=63341
have a great Christmas!
- Kwai chang caine
- Always Here
- Posts: 5342
- Joined: Sun Nov 05, 2006 11:42 pm
- Location: Lyon - France
Re: Detecting Text File Encoding without BOM
Cool and very usefull
Thanks for sharing
Thanks for sharing
The happiness is a road...
Not a destination
Not a destination
-
- Addict
- Posts: 4519
- Joined: Thu Jun 07, 2007 3:25 pm
- Location: Berlin, Germany
Re: Detecting Text File Encoding without BOM
Thank you so much for sharing, Demivec!
That's basic (no pun intended) and great.
I always wanted to have something like that.
In the module dte, it reads:
Is this intended?
I got your demo code running flawlessly only after deactivating CallDebugger in the module.
That's basic (no pun intended) and great.
I always wanted to have something like that.
In the module dte, it reads:
Code: Select all
Procedure.i detectTextEncodingInBuffer(*buffer, length, flags = 0)
CallDebugger
I got your demo code running flawlessly only after deactivating CallDebugger in the module.
Re: Detecting Text File Encoding without BOM
Well, no. I need to remove that.Little John wrote:In the module dte, it reads:
Code:
Procedure.i detectTextEncodingInBuffer(*buffer, length, flags = 0)
CallDebugger
Is this intended?
- Andre
- PureBasic Team
- Posts: 2056
- Joined: Fri Apr 25, 2003 6:14 pm
- Location: Germany (Saxony, Deutscheinsiedel)
- Contact:
Re: Detecting Text File Encoding without BOM
Even if I couldn't do further testing yet, it looks like a very useful code to me...
- Dreamland Fantasy
- Enthusiast
- Posts: 335
- Joined: Fri Jun 11, 2004 9:35 pm
- Location: Glasgow, UK
- Contact:
Re: Detecting Text File Encoding without BOM
Nice work Demivec, and exactly just what I was looking for!
Kind regards,
Francis
Kind regards,
Francis