Page 1 of 1

File Format Identification

Posted: Fri Aug 20, 2010 8:05 am
by peterb
Hi, I develop intelligent file format identification software and search any codes for reading structures of files for better intelligence. It works on more levels:

1. identify file type by known signature ( like MZ on executable, BP on bitmap, etc.. )

2. for known formats like executable, bitmap, load their structures and checking if it's correct

3. for text files search for defined strings and identify their encoding and content ( like XHTML 1.0 by the text - DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" )

4. for archive read list of files stored in archive (yet I had the codes for zip, rar and search/programm for more types )

5. ....

If you have some usefull code then please post it here.

Re: File Format Identification

Posted: Fri Aug 20, 2010 9:20 am
by Thorium
file format information: http://www.wotsit.org/

Checking the structure of a file for errors is a huge project itself for some file formats like Portable Executable (.exe)

Re: File Format Identification

Posted: Fri Aug 20, 2010 10:03 am
by peterb
Hi,

I know wotsit.org. My software is not determined for checking errors in files. I use to loading structures for checking content of files. If you use only signature checking, then detection can be mystified. If you detect executable file by two characters MZ then text file is started with MZ can be identified as executable ( It's only example ). But if you load structure IMAGE_DOS_HEADER and check their values, then you confirm file as executable file.

Peter

Re: File Format Identification

Posted: Fri Aug 20, 2010 3:00 pm
by Thorium
peterb wrote: I know wotsit.org. My software is not determined for checking errors in files. I use to loading structures for checking content of files. If you use only signature checking, then detection can be mystified. If you detect executable file by two characters MZ then text file is started with MZ can be identified as executable ( It's only example ). But if you load structure IMAGE_DOS_HEADER and check their values, then you confirm file as executable file.
Ah ok, i misunderstood you.

Checking the values might be problematic. You need a deep understanding of the file format and it's content, if you check for possible values. I see a big potential for false negativs there.

For identifing a text file: Thats just impossible. A plain text file isnt a file format. It's just a form of viewing a file. Every file can be viewed as a text file.

Re: File Format Identification

Posted: Fri Aug 20, 2010 3:56 pm
by cas
Thorium wrote:I see a big potential for false negativs there.
And false positives,too. :)

Re: File Format Identification

Posted: Fri Aug 20, 2010 4:55 pm
by peterb
svg, html, xml, .... is text file with defined format and you can determine which format is it.

I didn't write about PLAIN text :-)

Re: File Format Identification

Posted: Sat Aug 21, 2010 1:07 pm
by Hi-Toro
Hi peterb,

Here are some examples of determining some graphics file formats (in Blitz, easily portable to PB of course)...

http://www.blitzbasic.com/codearcs/code ... ?code=2477

(Search for "File format tests" on that page to find the right part -- GotBMP, GotGIF, GotJPEG, etc.)

Here's some code that reads the Gimp's .xcf file format:

http://www.blitzbasic.com/codearcs/code ... ?code=2585

(Search for "Main function code" to find the read section.)

Finally, here's some Win32 executable reading code, though it sounds like you have this covered...

http://www.blitzbasic.com/codearcs/code ... ?code=2628

(I had to do some hacking with the structures as Blitz unfortunately doesn't have the option to use plain structs like PB, but it'll be easy to see how to use it from PB. As you'll see in the comments, I was helped by code from 'thefool' on this forum originally!)