Database implementation/organization

Trond · Post by **Trond** » Wed Oct 08, 2008 6:28 pm

I'm complete database newbie, and I have a question of how to organize a database of a certain kind.

The functionality required is as follows:
A huge list of filenames with text attributes. This list should be searchable. Searching and inserting and deleting items should be fast. The list items never needs to be modified apart from deletion.

How would I best do this with tables and so on? I have absolutely no clue what makes a database fast.

Edit: each file can have any number of text attributes, a common case will be 4, rare case 10. However, many files can (and will) have the same attributes. I estimate the number of files for a given attribute string will range from 1 to 200 with 10 a common case.

srod · Post by **srod** » Wed Oct 08, 2008 7:00 pm

Since you are in for a variable number of attributes (assuming each attribute is a text field) then I would suggest two tables linked by a common 'key' field which can be a simple integer identifier. With SQL queries you can easily cross link the tables and pull out all entries related to a particular file.

The first table could simply hold the filenames together with a key field. The second table would hold the attributes with one record for each attribute linked to the appropriate filename.

E.g. suppose filename "test.exe" has 3 attributes; "h", "a" and "r".

Then the entry in the main table would consist of (001, 'test.exe') where 001 is the alloted key field. In the attributes table we then have 3 entries : (001, 'h') and (001, 'a') and (001, 'r') etc. A simple SQL statement of the form (depending on which database engine you opt for) :

Code: Select all

SELECT FILES.filename, ATTRIBUTES.attribute FROM FILES, ATTRIBUTES WHERE FILES.Id = ATTRIBUTES.ID

would return a recordset which includes the following records :

Code: Select all

('test.exe', 'h')
('test.exe', 'a')
('test.exe', 'r')
etc.

Course you will need to bone up on SQL first, and also decide if a database is your best option in this case!

Trond · Post by **Trond** » Wed Oct 08, 2008 7:18 pm

I think a database is a good idea, because there will potentially be a lot of data.

I have a question about your database layout (I hope I understood it correctly):
With this scheme I'd have to store each attribute string again for all files. So if 100 files has the attributes Red Big Squirrel Sensational, then I'd use a whopping 2,5 kb for storing the attributes, while 25 bytes would suffice, right? That's what I want to avoid. Is that not possible?

Frankly, I've never understood how databases could operate even remotely efficiently, and the more I learn about them the more stupid they seem. It could be that the engines uses some really clever tricks under the hood, but if not, there just seems to be either data duplication and/or severe query inefficiency every time.

Please tell me how the database can efficiently search ie. the attribute table. It's not sorted by search criteria! How can it be searched quickly???

milan1612 · Post by **milan1612** » Wed Oct 08, 2008 7:25 pm

Just a suggestion - but what about this:

Code: Select all

#r = 1
#a = 2
#h = 4
#w = 8
#s = 16
#t = 32
;usw...

Flag = #r | #a | #t | #h

Then you'd have to store only the filename and one single flag containing all the attributes.
You can then retrieve the information like so:

Code: Select all

If Flag & #a
  Debug "Contains attribute 'a'"
EndIf

Dunno if it suits your needs, I just remembered that I have used this before

srod · Post by **srod** » Wed Oct 08, 2008 7:34 pm

In which case you can use 3 tables; one for the filenames, one containing the unique attribute strings and another one to link them together.

Table : FILES

file_ID, filename
1, 'test.exe'
2, 'myDoc.txt'
etc.

Table : ATTRIBUTES

attribute_ID, attribute
'rdo', 'read only'
'arc', 'archive'
'hid', 'hidden'
etc.

Table : LINK

file_ID, attribute_ID
1, 'rdo'
1, 'arc'
2, 'arc'
etc.

Then the following SQL statement will retrieve all attributes for the 'test.exe' file:

Code: Select all

SELECT ATTRIBUTES.attribute FROM ATTRIBUTES, LINK WHERE LINK.file_Id = 1 AND  ATTRIBUTES.attribute_ID = LINK.attribute_ID

The above statement will return the following recordset :

Code: Select all

'read only'
'archive'

etc.

Foz · Post by **Foz** » Wed Oct 08, 2008 7:43 pm

There are a few methods that you can implement, but I think the way you want to go is with a "Tag Cloud".
As you are creating the files you give them tags which have a Tag table, such as:

Code: Select all

[Tag ID] int, primary key, auto increment
Description varchar(50) indexed

You have a similar Files table:

Code: Select all

[File ID] int, primary key, auto increment
Name varchar(255)

And then you have a File Tag Lookup:

Code: Select all

[File ID] int
[Tag ID] int,  indexed

Next you have to create the database entries, and you have two choices, split up all the words or allow phrases. I'll include both. So add a file:

Code: Select all

insert into Files ([Name]) values ('filea.txt')

(I'm assuming that you get the auto increment value here - you'll need to store it in memory)

Code: Select all

insert into Tags ([Description]) values ('Red')
insert into Tags ([Description]) values ('Big')
insert into Tags ([Description]) values ('Squirrel')
insert into Tags ([Description]) values ('Sensational')
insert into Tags ([Description]) values ('Red Big Squirrel Sensational')

(again I'm assuming that you get the auto increment value here for each insert - you'll need to store it in memory)

and finally we can fill in the lookup table:

Code: Select all

insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 1)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 2)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 3)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 4)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 5)
/* comments: this is adding if it was a new empty table:
filea.txt, Red
filea.txt, Big
filea.txt, Squirrel
filea.txt, Sensational
filea.txt, Red Big Squirrel Sensational
*/

and then to do a search you can do for a phrase:

Code: Select all

select [Name]
from Files
Where [File ID] in (
    select [File ID]
    from [File Tag Lookup]
    where [Tag ID] in (
        select [Tag ID]
        from [Tags]
        where Description = 'Red Big Squirrel Sensational'
    )
)

or to separate the words out:

Code: Select all

select [Name]
from Files
Where [File ID] in (
    select [File ID]
    from [File Tag Lookup]
    where [Tag ID] in (
        select [Tag ID]
        from [Tags]
        where Description in ('Red', 'Big', 'Sensational')
    )
)

If you need an example, I'll put together an SQLite example for you.

*** Edit: dammit, beaten by srod!

ts-soft · Post by **ts-soft** » Wed Oct 08, 2008 7:44 pm

srod wrote: Table : LINK
file_ID, attribute_ID
1, 'rdo'
1, 'arc'
2, 'arc'
etc.

Duplicated Data, not the best, better a table with combined attributes

srod · Post by **srod** » Wed Oct 08, 2008 7:46 pm

Thomas this is a standard trick when using many-many relationships. The important thing is that the LINK table does not have a key-field, or at the very least the two link fields shown are not defined as being key fields or even unique fields etc.

Rook Zimbabwe · Post by **Rook Zimbabwe** » Wed Oct 08, 2008 7:47 pm

You can download this:

http://www.bluemesapc.com/Downloads/filelist1.zip

+/- 8k zipped

Has 1 table: FileTable1

Columns are:

ID (autonumber for indexing)
filename
filedescription
ATT0 (the following are YES/NO)
ATT1
ATT2
ATT3
ATT4
ATT5
ATT6
ATT7
ATT8
ATT9
ATT10 (I was going to use this for ACTIVE or NOT ACTIVE)

If you need help pn a QUERY from the data... let me know

Trond · Post by **Trond** » Wed Oct 08, 2008 8:05 pm

milan1612 wrote:Just a suggestion - but what about this:
Code: Select all
#r = 1
#a = 2
#h = 4
#w = 8
#s = 16
#t = 32
;usw...

Flag = #r | #a | #t | #h
Then you'd have to store only the filename and one single flag containing all the attributes.
You can then retrieve the information like so:
Code: Select all
If Flag & #a
  Debug "Contains attribute 'a'"
EndIf
Dunno if it suits your needs, I just remembered that I have used this before

No, the attributes are text strings for a reason. There could be several thousand different ones, using masks is just not practical.

I will read the replies carefully, but I'd like to point out right now that:
- There should not be a limit of how many attributes can be associated with a file
- Prepare for a HUGE number of different attributes
- Prepare for attribute "lumping" (ie. some attributes are occur only once, some occur more than 100 times, possible more than 1000 times)

Maybe a database is not the right choice ... but then what is?

Trond · Post by **Trond** » Wed Oct 08, 2008 8:09 pm

Foz wrote: [Tag ID] int, primary key, auto increment
Description varchar(50) indexed

Indexed, now that sounds like some good stuff. Does it make the varchar more easily searchable, and how? (Ie. only for complete string match, not for submatches, etc...)

srod · Post by **srod** » Wed Oct 08, 2008 8:17 pm

Creating indexes will speed up queries using the fields which have been indexed, but could slow down inserts / updates etc. because the database engine will have to update the indexes as well etc.

Foz · Post by **Foz** » Wed Oct 08, 2008 8:17 pm

Indexing on servers means that it stores the rows with a sorted list with a quick search.

Depending on the SQL implementation, you would want it as a Unique Clustered Index - the table is sorted on insert, and only allows one instance of that entry in the table. This results in making searching extremely quick.

Trond · Post by **Trond** » Wed Oct 08, 2008 8:45 pm

Foz, your solutions looks great, except that I forgot to state the requirements for phrase search.

Maybe (or surely) I should have told you this from the beginning, but here is some sample data:
Seoan feat. Ekaterina - Under the Moon (full Moon Mix)
Starland Vocal Band - Afternoon Delight
Gazebo - I Like Chopin, Album: Musiques de Soirée Dance & Années 80

Basically I want a searchable index of this, and since like 90% of all songs are about love I figured I'd save some space by not saving the "love" tag so many times. Not to mention artist names and album names will be saved tons of times.

I want to search both the filename as raw text and extra attributes, also as raw text. I thought I could get around the raw text requirement by splitting both the extra attributes and the filename into words and treating them all like short attributes, but then phrase search would break (didn't think of that).

Maybe the best solution is just to use full-text search (and not splitting up the artist and song names into words), but won't that be horribly slow?

Foz · Post by **Foz** » Wed Oct 08, 2008 8:54 pm

This is what I would do:

- build a dictionary of "bad words": I, the, a, of, love
- strip out the punctuation
- split the string on the spaces
- remove the bad words
- insert each new tag - do a check for existing tags, you only need them once

do the same for the file name (yes, insert all the filename bits in the same routine)

so: "Seoan feat. Ekaterina - Under the Moon (full Moon Mix)"
becomes a tag list of
"Seoan", "feat" "Ekaterina", "Under", "Moon", "full", "Mix"

if the filename is exactly the same but with a .mp3 at the end then it would add the tag of "mp3"