Page 1 of 1

[Tip] Testing if a file is text or binary

Posted: Fri Jan 18, 2008 9:55 am
by PB
Here's my tip for testing if a file is text or binary. See comments in code.

BTW, I know about the ReadStringFormat command and the #PB_Ascii
result, but as the docs say: that isn't 100% foolproof or a file standard.

Also, don't forget that a file called test.txt with a 0 byte size is not really
a text file, so there is no bug with this procedure on such files. ;)

[Edit] See my enhanced version further down which doesn't use
AllocateMemory, PeekS and FileSeek. Thanks to AND51 for that!

Code: Select all

; IsTextFile by PB. Free for any use without credit needed. :)

; Reads the first 128 bytes of file$ to see if it's text or binary.
; If it contains Chr(0) then it's considered binary, as text files
; don't have it. Works fine with files less than 128 bytes in size.
; Probably not 100% foolproof but haven't had a false positive yet.
; Example below is for Windows, but the procedure is cross-platform.

; Return values: -1 if can't get a result, 0 for binary, 1 for text.

Procedure IsTextFile(file$)
  v=-1
  If ReadFile(0,file$)
    m=AllocateMemory(128)
    If m
      v=0 : ReadData(0,m,128) : m$=PeekS(m) : FreeMemory(m)
      FileSeek(0,0) : If m$<>ReadString(0) : v=1 : EndIf
    EndIf
    CloseFile(0)
  EndIf
  ProcedureReturn v
EndProcedure

windir$=Space(999) : GetWindowsDirectory_(windir$,999)
If Right(windir$,1)<>"\" : windir$+"\" : EndIf

If ExamineDirectory(0,windir$,"*.*")
  While NextDirectoryEntry(0)
    type=DirectoryEntryType(0)
    name$=DirectoryEntryName(0)
    If type=#PB_DirectoryEntry_File
      Debug name$+" = "+Str(IsTextFile(windir$+name$))
    EndIf
  Wend
  FinishDirectory(0)
EndIf

Posted: Fri Jan 18, 2008 11:03 am
by AND51
First, get used to #PB_any for your file IDs...
Your strict use of 0 as file ID may cause other opened files (that have previously been assigned 0 as file ID) to be closed automatically.
Code example:

Code: Select all

CreateFile(1, "test.txt")
   WriteString(1, "Test")
ReadFile(1, "test.txt") ; no need to close the created file. This is being done automatically, because of the same ID
   Debug ReadString(1)

; Program Ends - file is closed automatically
Using the same IDs to automatically free an old ressource and create a new one (no matter when it goes to files, images, sprites...) by using same IDs may save one code line. But this fact may disturb the program execution.


Second, please make your code EnableExplicit-compatible, so protect your variables, etc. :wink:

Posted: Fri Jan 18, 2008 11:24 am
by AND51
Much faster and more comfortable:

Code: Select all

Procedure myIsTextFile(file$, bytesToCheck=128)
	Protected file=ReadFile(#PB_Any, file$), result=Lof(file)
	If file
		While Not Eof(file) And  Loc(file) < bytesToCheck
			If Not ReadCharacter(file)
				result=0
				Break
			EndIf
		Wend
		CloseFile(file)
		ProcedureReturn result
	EndIf
	ProcedureReturn -1
EndProcedure





windir$=Space(999) : GetWindowsDirectory_(windir$,999) 
If Right(windir$,1)<>"\" : windir$+"\" : EndIf 

If ExamineDirectory(0,windir$,"*.*") 
  While NextDirectoryEntry(0) 
    type=DirectoryEntryType(0) 
    name$=DirectoryEntryName(0) 
    If type=#PB_DirectoryEntry_File 
      Debug name$+" = "+Str(myIsTextFile(windir$+name$)) 
    EndIf 
  Wend 
  FinishDirectory(0) 
EndIf 
  • Returning values:
    • -1 = file cannot be read
    • 0 = file is binary or empty
    • 1 = file is a text file
  • You can define how many bytes should be checked. I don't know what a good range is, so I set it to 128 (please, change this, if you want).

Posted: Fri Jan 18, 2008 12:20 pm
by PB
The snippets I post are just standalone examples that do not conflict with
anything, to show how it's done. It's up to the user to add any protections
later if they so desire to use them in their apps. :)

As for EnableExplicit, you didn't use it in your example either. :P

I don't know what sort of range is good either, but in my tests I found that
binary files always seem to have Chr(0) somewhere in the first 128 bytes.
This is probably not always true, but I haven't found a case yet where it
isn't, so it seems to be a fairly safe benchmark to use.

Also, if a file size is 0 then it can't be classed as a text file, because it has
no content. So your example is slightly flawed there. It tells me that a file
in my Windows folder called "ModemLog.txt" is a text file, when it isn't, as
it is 0 bytes in size. It's not a text file until it has actual text in it.

But your code has inspired me to shorten your example even further. :)
I've changed my approach slightly so that it returns -1 if it can't be read,
0 if not a text file, and >0 if it is. I've also removed the Loc() check which
should make it slightly faster again (ie. one less file operation).

So here goes:

Code: Select all

; IsTextFile by PB. Free for any use without credit needed. :)
; Thanks to AND51 for his help in optimizing my original code.

; Reads the first 128 bytes of file$ to see if it's text or binary.
; If it contains Chr(0) then it's considered binary, as text files
; don't have it. Works fine with files less than 128 bytes in size.
; Probably not 100% foolproof but haven't had a false positive yet.
; Example below is for Windows, but the procedure is cross-platform.

; Return values: -1 if can't read, 0 if not text, >0 if it is text.

Procedure IsTextFile(file$)
  v=-1
  If ReadFile(0,file$)
    While Not Eof(0) And p<128
      p+1 : v=ReadByte(0) : If v=0 : Break : EndIf
    Wend
    CloseFile(0)
  EndIf
  ProcedureReturn v
EndProcedure

windir$=Space(999) : GetWindowsDirectory_(windir$,999)
If Right(windir$,1)<>"\" : windir$+"\" : EndIf

If ExamineDirectory(0,windir$,"*.*")
  While NextDirectoryEntry(0)
    type=DirectoryEntryType(0)
    name$=DirectoryEntryName(0)
    If type=#PB_DirectoryEntry_File
      Debug name$+" = "+Str(IsTextFile(windir$+name$))
    EndIf
  Wend
  FinishDirectory(0)
EndIf

Posted: Fri Jan 18, 2008 1:05 pm
by AND51
> It's up to the user to add any protections later if they so desire.
This sub-forum is called "tips'n'tricks" and not "add your own protection".
Codes in this forum should not only be an example regarding how you achieve something, but also how you code.
If people should understand your code, then you should type your code more clear. I recommend to avoid 2-in1-lines with : all the time.
Remember, not only "advanced" or "expert" coders copy codes from this board, but also newbies. Your experience should remind you that newbies tend to procdure 'strange' errors, which might be caused by non-protected variables or assigning IDs in an 'unlogical' way.

> As for EnableExplicit, you didn't use it in your example either.
I didn't say, "Use 'EnableExplicit'!" I just said, "Make your code compatible with 'EnableExplicit'!"
Put EnableExplicit in front of my code and you'll see it works!

> Also, if a file size is 0 then it can't be classed as a text file, because it has no content. So your example is slightly flawed there. It tells me that a file in my Windows folder called "ModemLog.txt" is a text file, when it isn't, as it is 0 bytes in size. It's not a text file until it has actual text in it.
I wasn't sure... Your definition says:
Chr(0) = Binary.
From this I conclude:
If file contains not Chr(0), then it's not a binary file. And 0-Byte-files usually don't contain Chr(0)'s.

But I'll correct this.

Posted: Fri Jan 18, 2008 1:26 pm
by AND51
Don't forget to go now and reply to all the other Tip posts and tell them
what you just told me, okay? Because I forgot these are your forums.
It is not my intention to "attack" you (only)!
Of course I dislike the coding style of many others, this is out of question! But today I said it to you, because I was interested in improving your code by accident.
It has nothing to do with you personally!
But you're a good coder who released a lot of codes. I assume your codes are an example for many others.
Your code inspired me
You see?
I hope that not only the choice of commands and the performance convinced you, but also the coding style inspired you. :wink:
So let's stop arguing and be friends again... :o


// Edit:
Well, where is your post??

Posted: Fri Jan 18, 2008 2:05 pm
by Kale

Posted: Fri Jan 18, 2008 2:08 pm
by PB
> where is your post

I deleted it after posting, as I was starting to get sarcastic, but didn't want
to be. I'm trying to exercise more self-control. But you obviously saw it first.
Yes, let's remain friends as I didn't want this post to go the way it went.

Posted: Fri Jan 18, 2008 2:13 pm
by PB
> Here's somemore

Hi Kale, long time no see. :) Glad you're back.

The problem with the other ones in the link you gave, is that they scan the
entire files. I know that's probably the only way to be 100% sure, but when
you're scanning a folder full of files that can be 10 MB each, it's quite a long
time to finish. That's why I chose to check only the first 128 bytes of each,
and only look for Chr(0). This has worked 100% perfectly for me, but as
pointed out by others, it'll fail with Unicode text files and/or Asian files.

There has to be some quick and 100% perfect alternative, then. :(

Posted: Fri Jan 18, 2008 2:30 pm
by pdwyer
why do you need to differentiate?

Are you specifically trying to find one type or the other for some reason?

Posted: Fri Jan 18, 2008 2:56 pm
by PB
> why do you need to differentiate?

I'm parsing Firefox's cache, where the files have no extension, and I need to
find all HTML and TXT pages that have been cached. They are mixed in with
JPGs, WMVs, FLVs, etc, so scanning each and every file in its entirety is a
major hassle, and very time-consuming after a long browsing session.

Posted: Fri Jan 18, 2008 3:56 pm
by pdwyer
I was having a think about this in the bath (as one does over here ;) )

I think I see where you are going with this and I thought of a couple of other problems you might hit and some ideas for them. This might not be useful but I'll give it a go.

Since you don't want to slow it down by getting the whole file can I suggest either getting one disk block or 4096 bytes (as a common guess on ntfs) since the HDD will read in this increment anyway you may as well take the whole chuck since the time is the same. You can then probably perform quite a few tests on this block in memory for no significant performance degradation, since disk I/O is your bottleneck.

What tests to do could depend on what you are after. I'm gonna guess readable text for some sort of search or something. You might hit files that are all numbers, or mime / base 64 / UU encoded or something else which is effectively binary. On the other hand you might hit UTF16 which could be english but full of chr(0).

Some tests could be: (remember just on the 4k)

1) Char average. readable text will be about 100ish I guess where as binary likely be closer to 128. English UTF-16 will be lower still, english utf8 is ascii
2) Space distribution. If there's less than 1 space per 20 chars (less if you want to drop unicode) in a block over 1k is it going to be readable? in UTF8 and UTF16 the space will have a chr(32) in there.
3) existance of a BOM

There's probably more.

Lastly I did s peek on the net just now and under linux you would apparently use the "file" command. Doing a search on "Man File" to see how that app works I got this:

http://linux.die.net/man/1/file

Some interesting reading there on the topic that might help

Posted: Sat Jan 19, 2008 12:27 am
by AND51
@ you two:
I hope, I understand it correctly (I'm tired); but when it goes to Unicode, wouldn't it be better to use ReadCharacter() instead of ReadByte()?
I did so. So I assume, my code will also work with UTF-8 files. Do you agree or did I misunderstand anything?

Posted: Sat Jan 19, 2008 12:58 am
by Thalius
EDIT: bah =P

More about file:
http://linux.about.com/library/cmd/blcmdl1_file.htm

looks like this (-i just prints type):

Code: Select all

me@tuxbox:~> file mmorpg_docu.html -i
mmorpg_docu.html: text/html
me@tuxbox:~> file -i Irr3D.prefs 
Irr3D.prefs: text/plain; charset=us-ascii
me@tuxbox:~> file -i Mantel_Umhang.jpg 
Mantel_Umhang.jpg: image/jpeg
Cheers,
Thalius