Database implementation/organization

Everything else that doesn't fall into one of the other PB categories.
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Database implementation/organization

Post by Trond »

I'm complete database newbie, and I have a question of how to organize a database of a certain kind.

The functionality required is as follows:
A huge list of filenames with text attributes. This list should be searchable. Searching and inserting and deleting items should be fast. The list items never needs to be modified apart from deletion.

How would I best do this with tables and so on? I have absolutely no clue what makes a database fast.

Edit: each file can have any number of text attributes, a common case will be 4, rare case 10. However, many files can (and will) have the same attributes. I estimate the number of files for a given attribute string will range from 1 to 200 with 10 a common case.
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Since you are in for a variable number of attributes (assuming each attribute is a text field) then I would suggest two tables linked by a common 'key' field which can be a simple integer identifier. With SQL queries you can easily cross link the tables and pull out all entries related to a particular file.

The first table could simply hold the filenames together with a key field. The second table would hold the attributes with one record for each attribute linked to the appropriate filename.

E.g. suppose filename "test.exe" has 3 attributes; "h", "a" and "r".

Then the entry in the main table would consist of (001, 'test.exe') where 001 is the alloted key field. In the attributes table we then have 3 entries : (001, 'h') and (001, 'a') and (001, 'r') etc. A simple SQL statement of the form (depending on which database engine you opt for) :

Code: Select all

SELECT FILES.filename, ATTRIBUTES.attribute FROM FILES, ATTRIBUTES WHERE FILES.Id = ATTRIBUTES.ID
would return a recordset which includes the following records :

Code: Select all

('test.exe', 'h')
('test.exe', 'a')
('test.exe', 'r')
etc.
Course you will need to bone up on SQL first, and also decide if a database is your best option in this case! :wink:
I may look like a mule, but I'm not a complete ass.
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

I think a database is a good idea, because there will potentially be a lot of data.


I have a question about your database layout (I hope I understood it correctly):
With this scheme I'd have to store each attribute string again for all files. So if 100 files has the attributes Red Big Squirrel Sensational, then I'd use a whopping 2,5 kb for storing the attributes, while 25 bytes would suffice, right? That's what I want to avoid. Is that not possible?

Frankly, I've never understood how databases could operate even remotely efficiently, and the more I learn about them the more stupid they seem. It could be that the engines uses some really clever tricks under the hood, but if not, there just seems to be either data duplication and/or severe query inefficiency every time.

Please tell me how the database can efficiently search ie. the attribute table. It's not sorted by search criteria! How can it be searched quickly???
milan1612
Addict
Addict
Posts: 894
Joined: Thu Apr 05, 2007 12:15 am
Location: Nuremberg, Germany
Contact:

Post by milan1612 »

Just a suggestion - but what about this:

Code: Select all

#r = 1
#a = 2
#h = 4
#w = 8
#s = 16
#t = 32
;usw...

Flag = #r | #a | #t | #h
Then you'd have to store only the filename and one single flag containing all the attributes.
You can then retrieve the information like so:

Code: Select all

If Flag & #a
  Debug "Contains attribute 'a'"
EndIf
Dunno if it suits your needs, I just remembered that I have used this before :P
Windows 7 & PureBasic 4.4
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

In which case you can use 3 tables; one for the filenames, one containing the unique attribute strings and another one to link them together.

Table : FILES
file_ID, filename
1, 'test.exe'
2, 'myDoc.txt'
etc.
Table : ATTRIBUTES
attribute_ID, attribute
'rdo', 'read only'
'arc', 'archive'
'hid', 'hidden'
etc.
Table : LINK
file_ID, attribute_ID
1, 'rdo'
1, 'arc'
2, 'arc'
etc.
Then the following SQL statement will retrieve all attributes for the 'test.exe' file:

Code: Select all

SELECT ATTRIBUTES.attribute FROM ATTRIBUTES, LINK WHERE LINK.file_Id = 1 AND  ATTRIBUTES.attribute_ID = LINK.attribute_ID
The above statement will return the following recordset :

Code: Select all

'read only'
'archive'
etc.
I may look like a mule, but I'm not a complete ass.
Foz
Addict
Addict
Posts: 1359
Joined: Tue Nov 13, 2007 12:42 pm
Location: Manchester, UK

Post by Foz »

There are a few methods that you can implement, but I think the way you want to go is with a "Tag Cloud".
As you are creating the files you give them tags which have a Tag table, such as:

Code: Select all

[Tag ID] int, primary key, auto increment
Description varchar(50) indexed
You have a similar Files table:

Code: Select all

[File ID] int, primary key, auto increment
Name varchar(255)
And then you have a File Tag Lookup:

Code: Select all

[File ID] int
[Tag ID] int,  indexed
Next you have to create the database entries, and you have two choices, split up all the words or allow phrases. I'll include both. So add a file:

Code: Select all

insert into Files ([Name]) values ('filea.txt')
(I'm assuming that you get the auto increment value here - you'll need to store it in memory)

Code: Select all

insert into Tags ([Description]) values ('Red')
insert into Tags ([Description]) values ('Big')
insert into Tags ([Description]) values ('Squirrel')
insert into Tags ([Description]) values ('Sensational')
insert into Tags ([Description]) values ('Red Big Squirrel Sensational')
(again I'm assuming that you get the auto increment value here for each insert - you'll need to store it in memory)

and finally we can fill in the lookup table:

Code: Select all

insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 1)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 2)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 3)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 4)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 5)
/* comments: this is adding if it was a new empty table:
filea.txt, Red
filea.txt, Big
filea.txt, Squirrel
filea.txt, Sensational
filea.txt, Red Big Squirrel Sensational
*/
and then to do a search you can do for a phrase:

Code: Select all

select [Name]
from Files
Where [File ID] in (
    select [File ID]
    from [File Tag Lookup]
    where [Tag ID] in (
        select [Tag ID]
        from [Tags]
        where Description = 'Red Big Squirrel Sensational'
    )
)
or to separate the words out:

Code: Select all

select [Name]
from Files
Where [File ID] in (
    select [File ID]
    from [File Tag Lookup]
    where [Tag ID] in (
        select [Tag ID]
        from [Tags]
        where Description in ('Red', 'Big', 'Sensational')
    )
)
If you need an example, I'll put together an SQLite example for you.

*** Edit: dammit, beaten by srod!
Last edited by Foz on Wed Oct 08, 2008 7:45 pm, edited 1 time in total.
User avatar
ts-soft
Always Here
Always Here
Posts: 5756
Joined: Thu Jun 24, 2004 2:44 pm
Location: Berlin - Germany

Post by ts-soft »

srod wrote: Table : LINK
file_ID, attribute_ID
1, 'rdo'
1, 'arc'
2, 'arc'
etc.
Duplicated Data, not the best, better a table with combined attributes :wink:
PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.
Image
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Thomas this is a standard trick when using many-many relationships. The important thing is that the LINK table does not have a key-field, or at the very least the two link fields shown are not defined as being key fields or even unique fields etc. :)
I may look like a mule, but I'm not a complete ass.
User avatar
Rook Zimbabwe
Addict
Addict
Posts: 4322
Joined: Tue Jan 02, 2007 8:16 pm
Location: Cypress TX
Contact:

Post by Rook Zimbabwe »

You can download this:

http://www.bluemesapc.com/Downloads/filelist1.zip

+/- 8k zipped

Has 1 table: FileTable1

Columns are:

ID (autonumber for indexing)
filename
filedescription
ATT0 (the following are YES/NO)
ATT1
ATT2
ATT3
ATT4
ATT5
ATT6
ATT7
ATT8
ATT9
ATT10
(I was going to use this for ACTIVE or NOT ACTIVE)

If you need help pn a QUERY from the data... let me know
Binarily speaking... it takes 10 to Tango!!!

Image
http://www.bluemesapc.com/
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

milan1612 wrote:Just a suggestion - but what about this:

Code: Select all

#r = 1
#a = 2
#h = 4
#w = 8
#s = 16
#t = 32
;usw...

Flag = #r | #a | #t | #h
Then you'd have to store only the filename and one single flag containing all the attributes.
You can then retrieve the information like so:

Code: Select all

If Flag & #a
  Debug "Contains attribute 'a'"
EndIf
Dunno if it suits your needs, I just remembered that I have used this before :P
No, the attributes are text strings for a reason. There could be several thousand different ones, using masks is just not practical.

I will read the replies carefully, but I'd like to point out right now that:
- There should not be a limit of how many attributes can be associated with a file
- Prepare for a HUGE number of different attributes
- Prepare for attribute "lumping" (ie. some attributes are occur only once, some occur more than 100 times, possible more than 1000 times)

Maybe a database is not the right choice ... but then what is?
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

Foz wrote: [Tag ID] int, primary key, auto increment
Description varchar(50) indexed
Indexed, now that sounds like some good stuff. Does it make the varchar more easily searchable, and how? (Ie. only for complete string match, not for submatches, etc...)
srod
PureBasic Expert
PureBasic Expert
Posts: 10589
Joined: Wed Oct 29, 2003 4:35 pm
Location: Beyond the pale...

Post by srod »

Creating indexes will speed up queries using the fields which have been indexed, but could slow down inserts / updates etc. because the database engine will have to update the indexes as well etc.
I may look like a mule, but I'm not a complete ass.
Foz
Addict
Addict
Posts: 1359
Joined: Tue Nov 13, 2007 12:42 pm
Location: Manchester, UK

Post by Foz »

Indexing on servers means that it stores the rows with a sorted list with a quick search.

Depending on the SQL implementation, you would want it as a Unique Clustered Index - the table is sorted on insert, and only allows one instance of that entry in the table. This results in making searching extremely quick.
Trond
Always Here
Always Here
Posts: 7446
Joined: Mon Sep 22, 2003 6:45 pm
Location: Norway

Post by Trond »

Foz, your solutions looks great, except that I forgot to state the requirements for phrase search.

Maybe (or surely) I should have told you this from the beginning, but here is some sample data:
Seoan feat. Ekaterina - Under the Moon (full Moon Mix)
Starland Vocal Band - Afternoon Delight
Gazebo - I Like Chopin, Album: Musiques de Soirée Dance & Années 80

Basically I want a searchable index of this, and since like 90% of all songs are about love I figured I'd save some space by not saving the "love" tag so many times. Not to mention artist names and album names will be saved tons of times.

I want to search both the filename as raw text and extra attributes, also as raw text. I thought I could get around the raw text requirement by splitting both the extra attributes and the filename into words and treating them all like short attributes, but then phrase search would break (didn't think of that).

Maybe the best solution is just to use full-text search (and not splitting up the artist and song names into words), but won't that be horribly slow?
Foz
Addict
Addict
Posts: 1359
Joined: Tue Nov 13, 2007 12:42 pm
Location: Manchester, UK

Post by Foz »

This is what I would do:

- build a dictionary of "bad words": I, the, a, of, love
- strip out the punctuation
- split the string on the spaces
- remove the bad words
- insert each new tag - do a check for existing tags, you only need them once

do the same for the file name (yes, insert all the filename bits in the same routine)

so: "Seoan feat. Ekaterina - Under the Moon (full Moon Mix)"
becomes a tag list of
"Seoan", "feat" "Ekaterina", "Under", "Moon", "full", "Mix"

if the filename is exactly the same but with a .mp3 at the end then it would add the tag of "mp3"
Post Reply