im not sure how to structure this! Array of arrays or Lists

Just starting out? Need help? Post your questions and find answers here.
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

im not sure how to structure this! Array of arrays or Lists

Post by Keya »

hello, here is my problem! i hope i explain it ok

my program will read lots of different strings (single words, all uppercase) and there's going to be millions of them.

I only add a string if it doesn't exist, so before adding a new string I scan all the existing strings, but that's going to get slow with a long list. So to reduce the problem I want to store in 256 lists instead of 1, where the 1st character of each string determines which list it's stored in. This way when I add a word starting with B i only have to check the other B-words.

However, there could be milliiions of words starting with one letter and very few of another, so using a traditional 2D array isn't an option as more bytes would actually be wasted than used ... the array for X-words should be small and not the same size as the large E-words array for example.

And while using ArrayA(), ArrayB() etc would work, it would mean having 256 manually created arrays, so i wouldnt be able to reference all of them as a single array, so id basically have to write 256 duplicates of each code or at least 256 Case statements.

So it's a simple problem but i'm not sure there's a simple solution!? and speed is critical!
thankyou for any suggestions!!!!! :)
Last edited by Keya on Tue Feb 16, 2016 3:05 am, edited 1 time in total.
s0ula55a551n
User
User
Posts: 25
Joined: Fri Jan 01, 2016 5:55 pm

Re: im not sure how to structure this! Array of arrays

Post by s0ula55a551n »

Looking into this I cannot think of good way of doing this with purebasic. I would say sorted list but as there is no binary search or other serach functionality in purebasic can only search using for next then not sure how fast it would be searching through a list

have you tried adding several million numbers to a list and then searching through it to see how long it takes ?

Is the searching time critical ? Are you trying to create a spell checker ?
DontTalkToMe
Enthusiast
Enthusiast
Posts: 334
Joined: Mon Feb 04, 2013 5:28 pm

Re: im not sure how to structure this! Array of arrays

Post by DontTalkToMe »

Depends on what you have to do with the data once stored, but arrays and lists don't look particularly attractive.

If you only need to store unique strings, why don't use a map ?
You can check quickly if a string is already present, and if not you just add it to the map.

If you need to have the string sorted after the insertion, you can sort them once at the end or you can have them sorted in real time by using a balanced tree. Again fast lookup, moderately fast insertion, and sorted as you go.

PB does not have a tree so you have to do it by yourself.
User avatar
spikey
Enthusiast
Enthusiast
Posts: 778
Joined: Wed Sep 22, 2010 1:17 pm
Location: United Kingdom

Re: im not sure how to structure this! Array of arrays

Post by spikey »

Code: Select all

Structure WORDBUCKET
  Map WordList.I()
EndStructure

NewMap  TheBucket.WORDBUCKET()

TheBucket("A")\WordList("APPLE")
TheBucket("A")\WordList("ANTEATER")
TheBucket("A")\WordList("ANNAPOLIS")

TheBucket("B")\WordList("BANANA")
TheBucket("B")\WordList("BADGER")
TheBucket("B")\WordList("BALTIMORE")

TheBucket("C")\Wordlist("CARROT")
TheBucket("C")\Wordlist("CAMEL")
TheBucket("C")\Wordlist("CHICAGO")

ForEach TheBucket()
  Debug MapKey(TheBucket())
  ForEach TheBucket()\WordList()
    Debug MapKey(TheBucket()\WordList())
  Next TheBucket()\WordList()
Next TheBucket()
This way you can iterate the whole list with just two ForEach loops. You could do the same thing with Lists too but your unique test on adding would be a bit more complex. It depends whether or not preserving the word order is important. If sublist size still gets too complex/slow you could then further subdivide big groups EA, EB, EC etc if you wanted...
PureLust
Enthusiast
Enthusiast
Posts: 478
Joined: Mon Apr 16, 2007 3:57 am
Location: Germany, NRW

Re: im not sure how to structure this! Array of arrays

Post by PureLust »

Maybe you could store a checksum for each string as well (just a .Word or .Long) and compare it with the checksum of the new entry.

Then String-compare the found entries again (just in case, you will end up with the same checksum for different Strings).

Comparing just the checksums will be way faster than comparing millions of strings.

[Further:]
If your List is sorted, you could store all Strings in ONE large List.
Then create a second 'KeyPoint'-List with significant points in your main-list. (e.g. a key for the first Strings beginning with "aa..:", "ab...", "ac...",...), which keeps the ListPosition of that KeyPoint-Entry in your Main-List.

If you have a new entry, find the right (previous) KeyPoint for this new entry, set the ListPosition of your Main-List to that position and start checksum-comparsion from that point to the next marked KeyPoint-Position.
If you do not find the checksum between these two index-points, it's a new entry, otherwise ... see above. ;)
Last edited by PureLust on Mon Feb 15, 2016 2:04 pm, edited 1 time in total.
[Dynamic-Dialogs] - create complex GUIs the easy way
[DeFlicker] - easily deflicker your resizeable Windows
[WinFX] - Window Effects (incl. 'click-through' Window)
infratec
Always Here
Always Here
Posts: 7659
Joined: Sun Sep 07, 2008 12:45 pm
Location: Germany

Re: im not sure how to structure this! Array of arrays

Post by infratec »

When it comes to such an amount of data why not a database :?:

That's exactly for what they are.
And if you set the word to unique you can not insert it twice.

Bernd
User avatar
kenmo
Addict
Addict
Posts: 2049
Joined: Tue Dec 23, 2003 3:54 am

Re: im not sure how to structure this! Array of arrays

Post by kenmo »

Agreed, Map or Database may be the best approach.

You can use 256 arrays or lists...

Code: Select all

Structure KSubtable
  List K.s()
EndStructure

Global Dim KTable.KSubtable(255)

Procedure AddToTable(Keyword.s)
  i.a = PeekC(@Keyword)
  ForEach KTable(i)\K()
    If KTable(i)\K() = Keyword
      ProcedureReturn
    EndIf
  Next
  AddElement(KTable(i)\K())
  KTable(i)\K() = Keyword
EndProcedure

; add elements
AddToTable("ARTICHOKE")
AddToTable("APPLE")
AddToTable("BASIC")
AddToTable("BASIC")
AddToTable("BANANA")
AddToTable("ZEBRA")
AddToTable("ZEBRA")
AddToTable("BASIC")

; sort and display
For i = 0 To 255
  SortList(KTable(i)\K(), #PB_Sort_Ascending)
  ForEach KTable(i)\K()
    Debug KTable(i)\K()
  Next
Next i
But when you are dealing with millions of entries, scanning one "big" list vs. scanning one "less big" list may make little difference, O(n/256) ~ O(n)
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: im not sure how to structure this! Array of arrays

Post by Keya »

thanks everyone for suggestions! lots to ponder :) :)

Purelust,
yes i will almost certainly be using a checksum as well... one of my ideas is using crc16, in which case i'll have 65536 arrays, so the actual amount of string comparisons would get quite small. I'm not sure if i'll also include a 32bit checksum with each string.

infratec,
Just in regards to using a database... because SPEED is absolutely critical and the data is relatively simple (lots of data but simple) I can't help but think a general-purpose DB would be very slow compared to a custom native solution? even if using something like SQLightening. I need to run this for many days, so every tiny speed improvement is critical!

kenmo,
That array-list system looks possibly what i need, or at least quite close, thanks!

ok all, imma get back to reading your suggestions over and over to get the brain flowing! :)
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: im not sure how to structure this! Array of arrays

Post by Keya »

kenmo yes as it turns out i'll be able to create a solution based on your demo, thankyou again very much for that! :)
Ive tried to simplify the problem by using Quad instead of String, and im using 65535 Lists, using the first Word of the Quad as the List array index.

Here's a SINGLE-list version...

Code: Select all

Structure Sublist
  List Test.q()
EndStructure

Global TestList.Sublist

Procedure AddToTable(Newval.q)
  ForEach TestList\Test()
    If TestList\Test() = Newval
      ProcedureReturn
    EndIf
  Next
  AddElement(TestList\Test())
  TestList\Test() = Newval
EndProcedure

Time1 = ElapsedMilliseconds()
Define irnd.q
For i = 1 To 200000
  irnd.q = 9223372036854775807 - i    
  AddToTable(irnd)
Next i
Time2 = ElapsedMilliseconds(): MessageRequester("OK","Time="+Str(Time2-Time1))
That takes 17 SECONDS just to add the 200,000 Quads to the single list! And of course it gets exponentially worse the more elements in each list it has to scan through. That's basically the original problem i was facing and why i wanted to reduce it to arrays.

So with a subtle modification to use array List(65535) based on kenmo's demo it now takes 7 MILLISECONDS! due to not having to scan as many elements. A little bit more memory required of course but it doesnt seem too much. Problem solved! :)

Code: Select all

Structure Sublist
  List Test.q()
EndStructure

Global Dim TestList.Sublist(65535)

Procedure AddToTable(Newval.q)
  i.u = PeekU(@Newval)
  ForEach TestList(i)\Test()
    If TestList(i)\Test() = Newval
      ProcedureReturn
    EndIf
  Next
  AddElement(TestList(i)\Test())
  TestList(i)\Test() = Newval
EndProcedure

Time1 = ElapsedMilliseconds()
Define irnd.q
For i = 1 To 200000
  irnd.q = 9223372036854775807 - i    
  AddToTable(irnd)
Next i
Time2 = ElapsedMilliseconds(): MessageRequester("OK","Time="+Str(Time2-Time1))
User avatar
idle
Always Here
Always Here
Posts: 6018
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: im not sure how to structure this! Array of arrays or Li

Post by idle »

Keya wrote:hello, here is my problem! i hope i explain it ok

my program will read lots of different strings (single words, all uppercase) and there's going to be millions of them.
what you're wanting is a prefix trie, it's what they're made for. The more words entered the more space efficient the storage becomes
assuming the node size matches the alphabet size. Insertion and lookup are O(k) order.
A rudimentary one is here
http://www.purebasic.fr/english/viewtop ... 12&t=43652

Otherwise just use a map and size it appropriately so it doesn't have to rehash itself.
Windows 11, Manjaro, Raspberry Pi OS
Image
User avatar
Lunasole
Addict
Addict
Posts: 1091
Joined: Mon Oct 26, 2015 2:55 am
Location: UA
Contact:

Re: im not sure how to structure this! Array of arrays or Li

Post by Lunasole »

Keya wrote:
my program will read lots of different strings (single words, all uppercase) and there's going to be millions of them.

I only add a string if it doesn't exist, so before adding a new string I scan all the existing strings, but that's going to get slow with a long list. So to reduce the problem I want to store in 256 lists instead of 1, where the 1st character of each string determines which list it's stored in. This way when I add a word starting with B i only have to check the other B-words.
You don't need a lot of lists as it will only lead to that what I call bureaucracy:) not giving enough speed enhancement.

In some other topic I suggested you a binary tree. The simplest realization of it can add over 2 millions of words (to be exact, very long words: 32 chars I used to test) only wasting ~4 seconds for this, checking for already added strings is performed as well.

Implementation I'm using is one of simplest and shortest, it is present as singe array of simple structures (2 integer fields + 1 string with your data).
However it lacks functions to delete/sort items so it is usable only to add/find/edit/findduplicates and store/load, as it can be easily iterated and dumped to file.

Also other minus is that at 5+ millions it performance drops dramatically and it takes 20+ seconds on my PC to add such amount of items. Thats because simplest tree has some depth limit after reaching which it looses advantages and becomes so slow as regular list. To deal quickly with really big number of items need something more complicated, like Red-Black tree.
Well I'm talking too much 8) just try the tree if it match your requirements, I think it should be fast enough for 1-2 millions of items at least
"W̷i̷s̷h̷i̷n̷g o̷n a s̷t̷a̷r"
Post Reply