File Format Identification

Just starting out? Need help? Post your questions and find answers here.
peterb
User
User
Posts: 60
Joined: Sun Oct 02, 2005 8:55 am
Location: Czech Republic
Contact:

File Format Identification

Post by peterb »

Hi, I develop intelligent file format identification software and search any codes for reading structures of files for better intelligence. It works on more levels:

1. identify file type by known signature ( like MZ on executable, BP on bitmap, etc.. )

2. for known formats like executable, bitmap, load their structures and checking if it's correct

3. for text files search for defined strings and identify their encoding and content ( like XHTML 1.0 by the text - DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" )

4. for archive read list of files stored in archive (yet I had the codes for zip, rar and search/programm for more types )

5. ....

If you have some usefull code then please post it here.
Thorium
Addict
Addict
Posts: 1305
Joined: Sat Aug 15, 2009 6:59 pm

Re: File Format Identification

Post by Thorium »

file format information: http://www.wotsit.org/

Checking the structure of a file for errors is a huge project itself for some file formats like Portable Executable (.exe)
peterb
User
User
Posts: 60
Joined: Sun Oct 02, 2005 8:55 am
Location: Czech Republic
Contact:

Re: File Format Identification

Post by peterb »

Hi,

I know wotsit.org. My software is not determined for checking errors in files. I use to loading structures for checking content of files. If you use only signature checking, then detection can be mystified. If you detect executable file by two characters MZ then text file is started with MZ can be identified as executable ( It's only example ). But if you load structure IMAGE_DOS_HEADER and check their values, then you confirm file as executable file.

Peter
Thorium
Addict
Addict
Posts: 1305
Joined: Sat Aug 15, 2009 6:59 pm

Re: File Format Identification

Post by Thorium »

peterb wrote: I know wotsit.org. My software is not determined for checking errors in files. I use to loading structures for checking content of files. If you use only signature checking, then detection can be mystified. If you detect executable file by two characters MZ then text file is started with MZ can be identified as executable ( It's only example ). But if you load structure IMAGE_DOS_HEADER and check their values, then you confirm file as executable file.
Ah ok, i misunderstood you.

Checking the values might be problematic. You need a deep understanding of the file format and it's content, if you check for possible values. I see a big potential for false negativs there.

For identifing a text file: Thats just impossible. A plain text file isnt a file format. It's just a form of viewing a file. Every file can be viewed as a text file.
cas
Enthusiast
Enthusiast
Posts: 597
Joined: Mon Nov 03, 2008 9:56 pm

Re: File Format Identification

Post by cas »

Thorium wrote:I see a big potential for false negativs there.
And false positives,too. :)
peterb
User
User
Posts: 60
Joined: Sun Oct 02, 2005 8:55 am
Location: Czech Republic
Contact:

Re: File Format Identification

Post by peterb »

svg, html, xml, .... is text file with defined format and you can determine which format is it.

I didn't write about PLAIN text :-)
Hi-Toro
Enthusiast
Enthusiast
Posts: 270
Joined: Sat Apr 26, 2003 3:23 pm

Re: File Format Identification

Post by Hi-Toro »

Hi peterb,

Here are some examples of determining some graphics file formats (in Blitz, easily portable to PB of course)...

http://www.blitzbasic.com/codearcs/code ... ?code=2477

(Search for "File format tests" on that page to find the right part -- GotBMP, GotGIF, GotJPEG, etc.)

Here's some code that reads the Gimp's .xcf file format:

http://www.blitzbasic.com/codearcs/code ... ?code=2585

(Search for "Main function code" to find the read section.)

Finally, here's some Win32 executable reading code, though it sounds like you have this covered...

http://www.blitzbasic.com/codearcs/code ... ?code=2628

(I had to do some hacking with the structures as Blitz unfortunately doesn't have the option to use plain structs like PB, but it'll be easy to see how to use it from PB. As you'll see in the comments, I was helped by code from 'thefool' on this forum originally!)
James Boyd
http://www.hi-toro.com/
Death to the Pixies!
Post Reply