[Tip] Testing if a file is text or binary

Share your advanced PureBasic knowledge/code with the community.
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

[Tip] Testing if a file is text or binary

Post by PB »

Here's my tip for testing if a file is text or binary. See comments in code.

BTW, I know about the ReadStringFormat command and the #PB_Ascii
result, but as the docs say: that isn't 100% foolproof or a file standard.

Also, don't forget that a file called test.txt with a 0 byte size is not really
a text file, so there is no bug with this procedure on such files. ;)

[Edit] See my enhanced version further down which doesn't use
AllocateMemory, PeekS and FileSeek. Thanks to AND51 for that!

Code: Select all

; IsTextFile by PB. Free for any use without credit needed. :)

; Reads the first 128 bytes of file$ to see if it's text or binary.
; If it contains Chr(0) then it's considered binary, as text files
; don't have it. Works fine with files less than 128 bytes in size.
; Probably not 100% foolproof but haven't had a false positive yet.
; Example below is for Windows, but the procedure is cross-platform.

; Return values: -1 if can't get a result, 0 for binary, 1 for text.

Procedure IsTextFile(file$)
  v=-1
  If ReadFile(0,file$)
    m=AllocateMemory(128)
    If m
      v=0 : ReadData(0,m,128) : m$=PeekS(m) : FreeMemory(m)
      FileSeek(0,0) : If m$<>ReadString(0) : v=1 : EndIf
    EndIf
    CloseFile(0)
  EndIf
  ProcedureReturn v
EndProcedure

windir$=Space(999) : GetWindowsDirectory_(windir$,999)
If Right(windir$,1)<>"\" : windir$+"\" : EndIf

If ExamineDirectory(0,windir$,"*.*")
  While NextDirectoryEntry(0)
    type=DirectoryEntryType(0)
    name$=DirectoryEntryName(0)
    If type=#PB_DirectoryEntry_File
      Debug name$+" = "+Str(IsTextFile(windir$+name$))
    EndIf
  Wend
  FinishDirectory(0)
EndIf
Last edited by PB on Fri Jan 18, 2008 1:05 pm, edited 2 times in total.
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
AND51
Addict
Addict
Posts: 1040
Joined: Sun Oct 15, 2006 8:56 pm
Location: Germany
Contact:

Post by AND51 »

First, get used to #PB_any for your file IDs...
Your strict use of 0 as file ID may cause other opened files (that have previously been assigned 0 as file ID) to be closed automatically.
Code example:

Code: Select all

CreateFile(1, "test.txt")
   WriteString(1, "Test")
ReadFile(1, "test.txt") ; no need to close the created file. This is being done automatically, because of the same ID
   Debug ReadString(1)

; Program Ends - file is closed automatically
Using the same IDs to automatically free an old ressource and create a new one (no matter when it goes to files, images, sprites...) by using same IDs may save one code line. But this fact may disturb the program execution.


Second, please make your code EnableExplicit-compatible, so protect your variables, etc. :wink:
PB 4.30

Code: Select all

onErrorGoto(?Fred)
AND51
Addict
Addict
Posts: 1040
Joined: Sun Oct 15, 2006 8:56 pm
Location: Germany
Contact:

Post by AND51 »

Much faster and more comfortable:

Code: Select all

Procedure myIsTextFile(file$, bytesToCheck=128)
	Protected file=ReadFile(#PB_Any, file$), result=Lof(file)
	If file
		While Not Eof(file) And  Loc(file) < bytesToCheck
			If Not ReadCharacter(file)
				result=0
				Break
			EndIf
		Wend
		CloseFile(file)
		ProcedureReturn result
	EndIf
	ProcedureReturn -1
EndProcedure





windir$=Space(999) : GetWindowsDirectory_(windir$,999) 
If Right(windir$,1)<>"\" : windir$+"\" : EndIf 

If ExamineDirectory(0,windir$,"*.*") 
  While NextDirectoryEntry(0) 
    type=DirectoryEntryType(0) 
    name$=DirectoryEntryName(0) 
    If type=#PB_DirectoryEntry_File 
      Debug name$+" = "+Str(myIsTextFile(windir$+name$)) 
    EndIf 
  Wend 
  FinishDirectory(0) 
EndIf 
  • Returning values:
    • -1 = file cannot be read
    • 0 = file is binary or empty
    • 1 = file is a text file
  • You can define how many bytes should be checked. I don't know what a good range is, so I set it to 128 (please, change this, if you want).
Last edited by AND51 on Fri Jan 18, 2008 1:15 pm, edited 1 time in total.
PB 4.30

Code: Select all

onErrorGoto(?Fred)
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Post by PB »

The snippets I post are just standalone examples that do not conflict with
anything, to show how it's done. It's up to the user to add any protections
later if they so desire to use them in their apps. :)

As for EnableExplicit, you didn't use it in your example either. :P

I don't know what sort of range is good either, but in my tests I found that
binary files always seem to have Chr(0) somewhere in the first 128 bytes.
This is probably not always true, but I haven't found a case yet where it
isn't, so it seems to be a fairly safe benchmark to use.

Also, if a file size is 0 then it can't be classed as a text file, because it has
no content. So your example is slightly flawed there. It tells me that a file
in my Windows folder called "ModemLog.txt" is a text file, when it isn't, as
it is 0 bytes in size. It's not a text file until it has actual text in it.

But your code has inspired me to shorten your example even further. :)
I've changed my approach slightly so that it returns -1 if it can't be read,
0 if not a text file, and >0 if it is. I've also removed the Loc() check which
should make it slightly faster again (ie. one less file operation).

So here goes:

Code: Select all

; IsTextFile by PB. Free for any use without credit needed. :)
; Thanks to AND51 for his help in optimizing my original code.

; Reads the first 128 bytes of file$ to see if it's text or binary.
; If it contains Chr(0) then it's considered binary, as text files
; don't have it. Works fine with files less than 128 bytes in size.
; Probably not 100% foolproof but haven't had a false positive yet.
; Example below is for Windows, but the procedure is cross-platform.

; Return values: -1 if can't read, 0 if not text, >0 if it is text.

Procedure IsTextFile(file$)
  v=-1
  If ReadFile(0,file$)
    While Not Eof(0) And p<128
      p+1 : v=ReadByte(0) : If v=0 : Break : EndIf
    Wend
    CloseFile(0)
  EndIf
  ProcedureReturn v
EndProcedure

windir$=Space(999) : GetWindowsDirectory_(windir$,999)
If Right(windir$,1)<>"\" : windir$+"\" : EndIf

If ExamineDirectory(0,windir$,"*.*")
  While NextDirectoryEntry(0)
    type=DirectoryEntryType(0)
    name$=DirectoryEntryName(0)
    If type=#PB_DirectoryEntry_File
      Debug name$+" = "+Str(IsTextFile(windir$+name$))
    EndIf
  Wend
  FinishDirectory(0)
EndIf
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
AND51
Addict
Addict
Posts: 1040
Joined: Sun Oct 15, 2006 8:56 pm
Location: Germany
Contact:

Post by AND51 »

> It's up to the user to add any protections later if they so desire.
This sub-forum is called "tips'n'tricks" and not "add your own protection".
Codes in this forum should not only be an example regarding how you achieve something, but also how you code.
If people should understand your code, then you should type your code more clear. I recommend to avoid 2-in1-lines with : all the time.
Remember, not only "advanced" or "expert" coders copy codes from this board, but also newbies. Your experience should remind you that newbies tend to procdure 'strange' errors, which might be caused by non-protected variables or assigning IDs in an 'unlogical' way.

> As for EnableExplicit, you didn't use it in your example either.
I didn't say, "Use 'EnableExplicit'!" I just said, "Make your code compatible with 'EnableExplicit'!"
Put EnableExplicit in front of my code and you'll see it works!

> Also, if a file size is 0 then it can't be classed as a text file, because it has no content. So your example is slightly flawed there. It tells me that a file in my Windows folder called "ModemLog.txt" is a text file, when it isn't, as it is 0 bytes in size. It's not a text file until it has actual text in it.
I wasn't sure... Your definition says:
Chr(0) = Binary.
From this I conclude:
If file contains not Chr(0), then it's not a binary file. And 0-Byte-files usually don't contain Chr(0)'s.

But I'll correct this.
PB 4.30

Code: Select all

onErrorGoto(?Fred)
AND51
Addict
Addict
Posts: 1040
Joined: Sun Oct 15, 2006 8:56 pm
Location: Germany
Contact:

Post by AND51 »

Don't forget to go now and reply to all the other Tip posts and tell them
what you just told me, okay? Because I forgot these are your forums.
It is not my intention to "attack" you (only)!
Of course I dislike the coding style of many others, this is out of question! But today I said it to you, because I was interested in improving your code by accident.
It has nothing to do with you personally!
But you're a good coder who released a lot of codes. I assume your codes are an example for many others.
Your code inspired me
You see?
I hope that not only the choice of commands and the performance convinced you, but also the coding style inspired you. :wink:
So let's stop arguing and be friends again... :o


// Edit:
Well, where is your post??
PB 4.30

Code: Select all

onErrorGoto(?Fred)
Kale
PureBasic Expert
PureBasic Expert
Posts: 3000
Joined: Fri Apr 25, 2003 6:03 pm
Location: Lincoln, UK
Contact:

Post by Kale »

--Kale

Image
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Post by PB »

> where is your post

I deleted it after posting, as I was starting to get sarcastic, but didn't want
to be. I'm trying to exercise more self-control. But you obviously saw it first.
Yes, let's remain friends as I didn't want this post to go the way it went.
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Post by PB »

> Here's somemore

Hi Kale, long time no see. :) Glad you're back.

The problem with the other ones in the link you gave, is that they scan the
entire files. I know that's probably the only way to be 100% sure, but when
you're scanning a folder full of files that can be 10 MB each, it's quite a long
time to finish. That's why I chose to check only the first 128 bytes of each,
and only look for Chr(0). This has worked 100% perfectly for me, but as
pointed out by others, it'll fail with Unicode text files and/or Asian files.

There has to be some quick and 100% perfect alternative, then. :(
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

why do you need to differentiate?

Are you specifically trying to find one type or the other for some reason?
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
PB
PureBasic Expert
PureBasic Expert
Posts: 7581
Joined: Fri Apr 25, 2003 5:24 pm

Post by PB »

> why do you need to differentiate?

I'm parsing Firefox's cache, where the files have no extension, and I need to
find all HTML and TXT pages that have been cached. They are mixed in with
JPGs, WMVs, FLVs, etc, so scanning each and every file in its entirety is a
major hassle, and very time-consuming after a long browsing session.
I compile using 5.31 (x86) on Win 7 Ultimate (64-bit).
"PureBasic won't be object oriented, period" - Fred.
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

I was having a think about this in the bath (as one does over here ;) )

I think I see where you are going with this and I thought of a couple of other problems you might hit and some ideas for them. This might not be useful but I'll give it a go.

Since you don't want to slow it down by getting the whole file can I suggest either getting one disk block or 4096 bytes (as a common guess on ntfs) since the HDD will read in this increment anyway you may as well take the whole chuck since the time is the same. You can then probably perform quite a few tests on this block in memory for no significant performance degradation, since disk I/O is your bottleneck.

What tests to do could depend on what you are after. I'm gonna guess readable text for some sort of search or something. You might hit files that are all numbers, or mime / base 64 / UU encoded or something else which is effectively binary. On the other hand you might hit UTF16 which could be english but full of chr(0).

Some tests could be: (remember just on the 4k)

1) Char average. readable text will be about 100ish I guess where as binary likely be closer to 128. English UTF-16 will be lower still, english utf8 is ascii
2) Space distribution. If there's less than 1 space per 20 chars (less if you want to drop unicode) in a block over 1k is it going to be readable? in UTF8 and UTF16 the space will have a chr(32) in there.
3) existance of a BOM

There's probably more.

Lastly I did s peek on the net just now and under linux you would apparently use the "file" command. Doing a search on "Man File" to see how that app works I got this:

http://linux.die.net/man/1/file

Some interesting reading there on the topic that might help
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
AND51
Addict
Addict
Posts: 1040
Joined: Sun Oct 15, 2006 8:56 pm
Location: Germany
Contact:

Post by AND51 »

@ you two:
I hope, I understand it correctly (I'm tired); but when it goes to Unicode, wouldn't it be better to use ReadCharacter() instead of ReadByte()?
I did so. So I assume, my code will also work with UTF-8 files. Do you agree or did I misunderstand anything?
PB 4.30

Code: Select all

onErrorGoto(?Fred)
Thalius
Enthusiast
Enthusiast
Posts: 711
Joined: Thu Jul 17, 2003 4:15 pm
Contact:

Post by Thalius »

EDIT: bah =P

More about file:
http://linux.about.com/library/cmd/blcmdl1_file.htm

looks like this (-i just prints type):

Code: Select all

me@tuxbox:~> file mmorpg_docu.html -i
mmorpg_docu.html: text/html
me@tuxbox:~> file -i Irr3D.prefs 
Irr3D.prefs: text/plain; charset=us-ascii
me@tuxbox:~> file -i Mantel_Umhang.jpg 
Mantel_Umhang.jpg: image/jpeg
Cheers,
Thalius
"In 3D there is never enough Time to do Things right,
but there's always enough Time to make them *look* right."
"psssst! i steal signatures... don't tell anyone! ;)"
Post Reply