Page 1 of 1
im not sure how to structure this! Array of arrays or Lists
Posted: Mon Feb 15, 2016 12:59 pm
by Keya
hello, here is my problem! i hope i explain it ok
my program will read lots of different strings (single words, all uppercase) and there's going to be millions of them.
I only add a string if it doesn't exist, so before adding a new string I scan all the existing strings, but that's going to get slow with a long list. So to reduce the problem I want to store in 256 lists instead of 1, where the 1st character of each string determines which list it's stored in. This way when I add a word starting with B i only have to check the other B-words.
However, there could be milliiions of words starting with one letter and very few of another, so using a traditional 2D array isn't an option as more bytes would actually be wasted than used ... the array for X-words should be small and not the same size as the large E-words array for example.
And while using ArrayA(), ArrayB() etc would work, it would mean having 256 manually created arrays, so i wouldnt be able to reference all of them as a single array, so id basically have to write 256 duplicates of each code or at least 256 Case statements.
So it's a simple problem but i'm not sure there's a simple solution!? and speed is critical!
thankyou for any suggestions!!!!!

Re: im not sure how to structure this! Array of arrays
Posted: Mon Feb 15, 2016 1:27 pm
by s0ula55a551n
Looking into this I cannot think of good way of doing this with purebasic. I would say sorted list but as there is no binary search or other serach functionality in purebasic can only search using for next then not sure how fast it would be searching through a list
have you tried adding several million numbers to a list and then searching through it to see how long it takes ?
Is the searching time critical ? Are you trying to create a spell checker ?
Re: im not sure how to structure this! Array of arrays
Posted: Mon Feb 15, 2016 1:29 pm
by DontTalkToMe
Depends on what you have to do with the data once stored, but arrays and lists don't look particularly attractive.
If you only need to store unique strings, why don't use a map ?
You can check quickly if a string is already present, and if not you just add it to the map.
If you need to have the string sorted after the insertion, you can sort them once at the end or you can have them sorted in real time by using a balanced tree. Again fast lookup, moderately fast insertion, and sorted as you go.
PB does not have a tree so you have to do it by yourself.
Re: im not sure how to structure this! Array of arrays
Posted: Mon Feb 15, 2016 1:39 pm
by spikey
Code: Select all
Structure WORDBUCKET
Map WordList.I()
EndStructure
NewMap TheBucket.WORDBUCKET()
TheBucket("A")\WordList("APPLE")
TheBucket("A")\WordList("ANTEATER")
TheBucket("A")\WordList("ANNAPOLIS")
TheBucket("B")\WordList("BANANA")
TheBucket("B")\WordList("BADGER")
TheBucket("B")\WordList("BALTIMORE")
TheBucket("C")\Wordlist("CARROT")
TheBucket("C")\Wordlist("CAMEL")
TheBucket("C")\Wordlist("CHICAGO")
ForEach TheBucket()
Debug MapKey(TheBucket())
ForEach TheBucket()\WordList()
Debug MapKey(TheBucket()\WordList())
Next TheBucket()\WordList()
Next TheBucket()
This way you can iterate the whole list with just two ForEach loops. You could do the same thing with Lists too but your unique test on adding would be a bit more complex. It depends whether or not preserving the word order is important. If sublist size still gets too complex/slow you could then further subdivide big groups EA, EB, EC etc if you wanted...
Re: im not sure how to structure this! Array of arrays
Posted: Mon Feb 15, 2016 1:46 pm
by PureLust
Maybe you could store a checksum for each string as well (just a .Word or .Long) and compare it with the checksum of the new entry.
Then String-compare the found entries again (just in case, you will end up with the same checksum for different Strings).
Comparing just the checksums will be way faster than comparing millions of strings.
[Further:]
If your List is sorted, you could store all Strings in ONE large List.
Then create a second 'KeyPoint'-List with significant points in your main-list. (e.g. a key for the first Strings beginning with "aa..:", "ab...", "ac...",...), which keeps the ListPosition of that KeyPoint-Entry in your Main-List.
If you have a new entry, find the right (previous) KeyPoint for this new entry, set the ListPosition of your Main-List to that position and start checksum-comparsion from that point to the next marked KeyPoint-Position.
If you do not find the checksum between these two index-points, it's a new entry, otherwise ... see above.

Re: im not sure how to structure this! Array of arrays
Posted: Mon Feb 15, 2016 1:58 pm
by infratec
When it comes to such an amount of data why not a database
That's exactly for what they are.
And if you set the word to unique you can not insert it twice.
Bernd
Re: im not sure how to structure this! Array of arrays
Posted: Mon Feb 15, 2016 2:10 pm
by kenmo
Agreed, Map or Database may be the best approach.
You can use 256 arrays or lists...
Code: Select all
Structure KSubtable
List K.s()
EndStructure
Global Dim KTable.KSubtable(255)
Procedure AddToTable(Keyword.s)
i.a = PeekC(@Keyword)
ForEach KTable(i)\K()
If KTable(i)\K() = Keyword
ProcedureReturn
EndIf
Next
AddElement(KTable(i)\K())
KTable(i)\K() = Keyword
EndProcedure
; add elements
AddToTable("ARTICHOKE")
AddToTable("APPLE")
AddToTable("BASIC")
AddToTable("BASIC")
AddToTable("BANANA")
AddToTable("ZEBRA")
AddToTable("ZEBRA")
AddToTable("BASIC")
; sort and display
For i = 0 To 255
SortList(KTable(i)\K(), #PB_Sort_Ascending)
ForEach KTable(i)\K()
Debug KTable(i)\K()
Next
Next i
But when you are dealing with millions of entries, scanning one "big" list vs. scanning one "less big" list may make little difference, O(n/256) ~ O(n)
Re: im not sure how to structure this! Array of arrays
Posted: Mon Feb 15, 2016 11:49 pm
by Keya
thanks everyone for suggestions! lots to ponder
Purelust,
yes i will almost certainly be using a checksum as well... one of my ideas is using crc16, in which case i'll have 65536 arrays, so the actual amount of string comparisons would get quite small. I'm not sure if i'll also include a 32bit checksum with each string.
infratec,
Just in regards to using a database... because SPEED is absolutely critical and the data is relatively simple (lots of data but simple) I can't help but think a general-purpose DB would be very slow compared to a custom native solution? even if using something like SQLightening. I need to run this for many days, so every tiny speed improvement is critical!
kenmo,
That array-list system looks
possibly what i need, or at least quite close, thanks!
ok all, imma get back to reading your suggestions over and over to get the brain flowing!

Re: im not sure how to structure this! Array of arrays
Posted: Tue Feb 16, 2016 2:43 am
by Keya
kenmo yes as it turns out i'll be able to create a solution based on your demo, thankyou again very much for that!

Ive tried to simplify the problem by using Quad instead of String, and im using 65535 Lists, using the first Word of the Quad as the List array index.
Here's a SINGLE-list version...
Code: Select all
Structure Sublist
List Test.q()
EndStructure
Global TestList.Sublist
Procedure AddToTable(Newval.q)
ForEach TestList\Test()
If TestList\Test() = Newval
ProcedureReturn
EndIf
Next
AddElement(TestList\Test())
TestList\Test() = Newval
EndProcedure
Time1 = ElapsedMilliseconds()
Define irnd.q
For i = 1 To 200000
irnd.q = 9223372036854775807 - i
AddToTable(irnd)
Next i
Time2 = ElapsedMilliseconds(): MessageRequester("OK","Time="+Str(Time2-Time1))
That takes
17 SECONDS just to add the 200,000 Quads to the single list! And of course it gets exponentially worse the more elements in each list it has to scan through. That's basically the original problem i was facing and why i wanted to reduce it to arrays.
So with a subtle modification to use array List(65535) based on kenmo's demo it now takes
7 MILLISECONDS! due to not having to scan as many elements. A little bit more memory required of course but it doesnt seem too much. Problem solved!
Code: Select all
Structure Sublist
List Test.q()
EndStructure
Global Dim TestList.Sublist(65535)
Procedure AddToTable(Newval.q)
i.u = PeekU(@Newval)
ForEach TestList(i)\Test()
If TestList(i)\Test() = Newval
ProcedureReturn
EndIf
Next
AddElement(TestList(i)\Test())
TestList(i)\Test() = Newval
EndProcedure
Time1 = ElapsedMilliseconds()
Define irnd.q
For i = 1 To 200000
irnd.q = 9223372036854775807 - i
AddToTable(irnd)
Next i
Time2 = ElapsedMilliseconds(): MessageRequester("OK","Time="+Str(Time2-Time1))
Re: im not sure how to structure this! Array of arrays or Li
Posted: Tue Feb 16, 2016 3:49 am
by idle
Keya wrote:hello, here is my problem! i hope i explain it ok
my program will read lots of different strings (single words, all uppercase) and there's going to be millions of them.
what you're wanting is a prefix trie, it's what they're made for. The more words entered the more space efficient the storage becomes
assuming the node size matches the alphabet size. Insertion and lookup are O(k) order.
A rudimentary one is here
http://www.purebasic.fr/english/viewtop ... 12&t=43652
Otherwise just use a map and size it appropriately so it doesn't have to rehash itself.
Re: im not sure how to structure this! Array of arrays or Li
Posted: Tue Feb 16, 2016 6:26 am
by Lunasole
Keya wrote:
my program will read lots of different strings (single words, all uppercase) and there's going to be millions of them.
I only add a string if it doesn't exist, so before adding a new string I scan all the existing strings, but that's going to get slow with a long list. So to reduce the problem I want to store in 256 lists instead of 1, where the 1st character of each string determines which list it's stored in. This way when I add a word starting with B i only have to check the other B-words.
You don't need a lot of lists as it will only lead to that what I call bureaucracy:) not giving enough speed enhancement.
In some other topic I suggested you a binary tree. The simplest realization of it can add over 2 millions of words (to be exact, very long words: 32 chars I used to test) only wasting ~4 seconds for this, checking for already added strings is performed as well.
Implementation I'm using is one of simplest and shortest, it is present as singe array of simple structures (2 integer fields + 1 string with your data).
However it lacks functions to delete/sort items so it is usable only to add/find/edit/findduplicates and store/load, as it can be easily iterated and dumped to file.
Also other minus is that at 5+ millions it performance drops dramatically and it takes 20+ seconds on my PC to add such amount of items. Thats because simplest tree has some depth limit after reaching which it looses advantages and becomes so slow as regular list. To deal quickly with really big number of items need something more complicated, like Red-Black tree.
Well I'm talking too much

just try the tree if it match your requirements, I think it should be fast enough for 1-2 millions of items at least