Database implementation/organization
Database implementation/organization
I'm complete database newbie, and I have a question of how to organize a database of a certain kind.
The functionality required is as follows:
A huge list of filenames with text attributes. This list should be searchable. Searching and inserting and deleting items should be fast. The list items never needs to be modified apart from deletion.
How would I best do this with tables and so on? I have absolutely no clue what makes a database fast.
Edit: each file can have any number of text attributes, a common case will be 4, rare case 10. However, many files can (and will) have the same attributes. I estimate the number of files for a given attribute string will range from 1 to 200 with 10 a common case.
The functionality required is as follows:
A huge list of filenames with text attributes. This list should be searchable. Searching and inserting and deleting items should be fast. The list items never needs to be modified apart from deletion.
How would I best do this with tables and so on? I have absolutely no clue what makes a database fast.
Edit: each file can have any number of text attributes, a common case will be 4, rare case 10. However, many files can (and will) have the same attributes. I estimate the number of files for a given attribute string will range from 1 to 200 with 10 a common case.
Since you are in for a variable number of attributes (assuming each attribute is a text field) then I would suggest two tables linked by a common 'key' field which can be a simple integer identifier. With SQL queries you can easily cross link the tables and pull out all entries related to a particular file.
The first table could simply hold the filenames together with a key field. The second table would hold the attributes with one record for each attribute linked to the appropriate filename.
E.g. suppose filename "test.exe" has 3 attributes; "h", "a" and "r".
Then the entry in the main table would consist of (001, 'test.exe') where 001 is the alloted key field. In the attributes table we then have 3 entries : (001, 'h') and (001, 'a') and (001, 'r') etc. A simple SQL statement of the form (depending on which database engine you opt for) :
would return a recordset which includes the following records :
Course you will need to bone up on SQL first, and also decide if a database is your best option in this case! 
The first table could simply hold the filenames together with a key field. The second table would hold the attributes with one record for each attribute linked to the appropriate filename.
E.g. suppose filename "test.exe" has 3 attributes; "h", "a" and "r".
Then the entry in the main table would consist of (001, 'test.exe') where 001 is the alloted key field. In the attributes table we then have 3 entries : (001, 'h') and (001, 'a') and (001, 'r') etc. A simple SQL statement of the form (depending on which database engine you opt for) :
Code: Select all
SELECT FILES.filename, ATTRIBUTES.attribute FROM FILES, ATTRIBUTES WHERE FILES.Id = ATTRIBUTES.ID
Code: Select all
('test.exe', 'h')
('test.exe', 'a')
('test.exe', 'r')
etc.

I may look like a mule, but I'm not a complete ass.
I think a database is a good idea, because there will potentially be a lot of data.
I have a question about your database layout (I hope I understood it correctly):
With this scheme I'd have to store each attribute string again for all files. So if 100 files has the attributes Red Big Squirrel Sensational, then I'd use a whopping 2,5 kb for storing the attributes, while 25 bytes would suffice, right? That's what I want to avoid. Is that not possible?
Frankly, I've never understood how databases could operate even remotely efficiently, and the more I learn about them the more stupid they seem. It could be that the engines uses some really clever tricks under the hood, but if not, there just seems to be either data duplication and/or severe query inefficiency every time.
Please tell me how the database can efficiently search ie. the attribute table. It's not sorted by search criteria! How can it be searched quickly???
I have a question about your database layout (I hope I understood it correctly):
With this scheme I'd have to store each attribute string again for all files. So if 100 files has the attributes Red Big Squirrel Sensational, then I'd use a whopping 2,5 kb for storing the attributes, while 25 bytes would suffice, right? That's what I want to avoid. Is that not possible?
Frankly, I've never understood how databases could operate even remotely efficiently, and the more I learn about them the more stupid they seem. It could be that the engines uses some really clever tricks under the hood, but if not, there just seems to be either data duplication and/or severe query inefficiency every time.
Please tell me how the database can efficiently search ie. the attribute table. It's not sorted by search criteria! How can it be searched quickly???
Just a suggestion - but what about this:
Then you'd have to store only the filename and one single flag containing all the attributes.
You can then retrieve the information like so:
Dunno if it suits your needs, I just remembered that I have used this before 
Code: Select all
#r = 1
#a = 2
#h = 4
#w = 8
#s = 16
#t = 32
;usw...
Flag = #r | #a | #t | #h
You can then retrieve the information like so:
Code: Select all
If Flag & #a
Debug "Contains attribute 'a'"
EndIf

Windows 7 & PureBasic 4.4
In which case you can use 3 tables; one for the filenames, one containing the unique attribute strings and another one to link them together.
Table : FILES
The above statement will return the following recordset :
etc.
Table : FILES
Table : ATTRIBUTESfile_ID, filename
1, 'test.exe'
2, 'myDoc.txt'
etc.
Table : LINKattribute_ID, attribute
'rdo', 'read only'
'arc', 'archive'
'hid', 'hidden'
etc.
Then the following SQL statement will retrieve all attributes for the 'test.exe' file:file_ID, attribute_ID
1, 'rdo'
1, 'arc'
2, 'arc'
etc.
Code: Select all
SELECT ATTRIBUTES.attribute FROM ATTRIBUTES, LINK WHERE LINK.file_Id = 1 AND ATTRIBUTES.attribute_ID = LINK.attribute_ID
Code: Select all
'read only'
'archive'
I may look like a mule, but I'm not a complete ass.
There are a few methods that you can implement, but I think the way you want to go is with a "Tag Cloud".
As you are creating the files you give them tags which have a Tag table, such as:
You have a similar Files table:
And then you have a File Tag Lookup:
Next you have to create the database entries, and you have two choices, split up all the words or allow phrases. I'll include both. So add a file:
(I'm assuming that you get the auto increment value here - you'll need to store it in memory)
(again I'm assuming that you get the auto increment value here for each insert - you'll need to store it in memory)
and finally we can fill in the lookup table:
and then to do a search you can do for a phrase:
or to separate the words out:
If you need an example, I'll put together an SQLite example for you.
*** Edit: dammit, beaten by srod!
As you are creating the files you give them tags which have a Tag table, such as:
Code: Select all
[Tag ID] int, primary key, auto increment
Description varchar(50) indexed
Code: Select all
[File ID] int, primary key, auto increment
Name varchar(255)
Code: Select all
[File ID] int
[Tag ID] int, indexed
Code: Select all
insert into Files ([Name]) values ('filea.txt')
Code: Select all
insert into Tags ([Description]) values ('Red')
insert into Tags ([Description]) values ('Big')
insert into Tags ([Description]) values ('Squirrel')
insert into Tags ([Description]) values ('Sensational')
insert into Tags ([Description]) values ('Red Big Squirrel Sensational')
and finally we can fill in the lookup table:
Code: Select all
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 1)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 2)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 3)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 4)
insert into [File Tag Lookup] ([File ID], [Tag ID]) values (1, 5)
/* comments: this is adding if it was a new empty table:
filea.txt, Red
filea.txt, Big
filea.txt, Squirrel
filea.txt, Sensational
filea.txt, Red Big Squirrel Sensational
*/
Code: Select all
select [Name]
from Files
Where [File ID] in (
select [File ID]
from [File Tag Lookup]
where [Tag ID] in (
select [Tag ID]
from [Tags]
where Description = 'Red Big Squirrel Sensational'
)
)
Code: Select all
select [Name]
from Files
Where [File ID] in (
select [File ID]
from [File Tag Lookup]
where [Tag ID] in (
select [Tag ID]
from [Tags]
where Description in ('Red', 'Big', 'Sensational')
)
)
*** Edit: dammit, beaten by srod!
Last edited by Foz on Wed Oct 08, 2008 7:45 pm, edited 1 time in total.
Duplicated Data, not the best, better a table with combined attributessrod wrote: Table : LINKfile_ID, attribute_ID
1, 'rdo'
1, 'arc'
2, 'arc'
etc.

PureBasic 5.73 | SpiderBasic 2.30 | Windows 10 Pro (x64) | Linux Mint 20.1 (x64)
Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.

Old bugs good, new bugs bad! Updates are evil: might fix old bugs and introduce no new ones.

- Rook Zimbabwe
- Addict
- Posts: 4322
- Joined: Tue Jan 02, 2007 8:16 pm
- Location: Cypress TX
- Contact:
You can download this:
http://www.bluemesapc.com/Downloads/filelist1.zip
+/- 8k zipped
Has 1 table: FileTable1
Columns are:
ID (autonumber for indexing)
filename
filedescription
ATT0 (the following are YES/NO)
ATT1
ATT2
ATT3
ATT4
ATT5
ATT6
ATT7
ATT8
ATT9
ATT10 (I was going to use this for ACTIVE or NOT ACTIVE)
If you need help pn a QUERY from the data... let me know
http://www.bluemesapc.com/Downloads/filelist1.zip
+/- 8k zipped
Has 1 table: FileTable1
Columns are:
ID (autonumber for indexing)
filename
filedescription
ATT0 (the following are YES/NO)
ATT1
ATT2
ATT3
ATT4
ATT5
ATT6
ATT7
ATT8
ATT9
ATT10 (I was going to use this for ACTIVE or NOT ACTIVE)
If you need help pn a QUERY from the data... let me know
No, the attributes are text strings for a reason. There could be several thousand different ones, using masks is just not practical.milan1612 wrote:Just a suggestion - but what about this:Then you'd have to store only the filename and one single flag containing all the attributes.Code: Select all
#r = 1 #a = 2 #h = 4 #w = 8 #s = 16 #t = 32 ;usw... Flag = #r | #a | #t | #h
You can then retrieve the information like so:Dunno if it suits your needs, I just remembered that I have used this beforeCode: Select all
If Flag & #a Debug "Contains attribute 'a'" EndIf
I will read the replies carefully, but I'd like to point out right now that:
- There should not be a limit of how many attributes can be associated with a file
- Prepare for a HUGE number of different attributes
- Prepare for attribute "lumping" (ie. some attributes are occur only once, some occur more than 100 times, possible more than 1000 times)
Maybe a database is not the right choice ... but then what is?
Indexing on servers means that it stores the rows with a sorted list with a quick search.
Depending on the SQL implementation, you would want it as a Unique Clustered Index - the table is sorted on insert, and only allows one instance of that entry in the table. This results in making searching extremely quick.
Depending on the SQL implementation, you would want it as a Unique Clustered Index - the table is sorted on insert, and only allows one instance of that entry in the table. This results in making searching extremely quick.
Foz, your solutions looks great, except that I forgot to state the requirements for phrase search.
Maybe (or surely) I should have told you this from the beginning, but here is some sample data:
Seoan feat. Ekaterina - Under the Moon (full Moon Mix)
Starland Vocal Band - Afternoon Delight
Gazebo - I Like Chopin, Album: Musiques de Soirée Dance & Années 80
Basically I want a searchable index of this, and since like 90% of all songs are about love I figured I'd save some space by not saving the "love" tag so many times. Not to mention artist names and album names will be saved tons of times.
I want to search both the filename as raw text and extra attributes, also as raw text. I thought I could get around the raw text requirement by splitting both the extra attributes and the filename into words and treating them all like short attributes, but then phrase search would break (didn't think of that).
Maybe the best solution is just to use full-text search (and not splitting up the artist and song names into words), but won't that be horribly slow?
Maybe (or surely) I should have told you this from the beginning, but here is some sample data:
Seoan feat. Ekaterina - Under the Moon (full Moon Mix)
Starland Vocal Band - Afternoon Delight
Gazebo - I Like Chopin, Album: Musiques de Soirée Dance & Années 80
Basically I want a searchable index of this, and since like 90% of all songs are about love I figured I'd save some space by not saving the "love" tag so many times. Not to mention artist names and album names will be saved tons of times.
I want to search both the filename as raw text and extra attributes, also as raw text. I thought I could get around the raw text requirement by splitting both the extra attributes and the filename into words and treating them all like short attributes, but then phrase search would break (didn't think of that).
Maybe the best solution is just to use full-text search (and not splitting up the artist and song names into words), but won't that be horribly slow?
This is what I would do:
- build a dictionary of "bad words": I, the, a, of, love
- strip out the punctuation
- split the string on the spaces
- remove the bad words
- insert each new tag - do a check for existing tags, you only need them once
do the same for the file name (yes, insert all the filename bits in the same routine)
so: "Seoan feat. Ekaterina - Under the Moon (full Moon Mix)"
becomes a tag list of
"Seoan", "feat" "Ekaterina", "Under", "Moon", "full", "Mix"
if the filename is exactly the same but with a .mp3 at the end then it would add the tag of "mp3"
- build a dictionary of "bad words": I, the, a, of, love
- strip out the punctuation
- split the string on the spaces
- remove the bad words
- insert each new tag - do a check for existing tags, you only need them once
do the same for the file name (yes, insert all the filename bits in the same routine)
so: "Seoan feat. Ekaterina - Under the Moon (full Moon Mix)"
becomes a tag list of
"Seoan", "feat" "Ekaterina", "Under", "Moon", "full", "Mix"
if the filename is exactly the same but with a .mp3 at the end then it would add the tag of "mp3"