A small procedure asm

Sooraa · Post by **Sooraa** » Mon May 17, 2021 8:34 pm

@Wilbert

Thanks for the link. In fact, I jumped into SSE4.2 through Helle's first 64bit-Ascii-SSE4.2 search function and started back then.

I created several versions with PCMP*STR* and became familiar with this instructions.
But as said, my current try to transform the Equal-ordered Index-version into a bitmask output becomes too complicated, because
it also produces a bitmask of sub-substrings.... Then the speed advantage vanishes.

No, I hope that you can trigger something new....

Sooraa · Post by **Sooraa** » Wed May 19, 2021 8:48 am

@Idle, @Wilbert

I thinks it's time for a break in our efforts to tame the big-data-beast "K-mering" of genomes.

I thank you both for your engagement so far. You also helped to detect the limits.

What is learned so far?
- I will fall back to the approach "change memory consumtion into process time"
- PB is a propriate environment for these jobs, even large files can be read by "ReadData" and grabbed by "PEEKS / CopyMemory" in
variable chunks in high speed. This helps to keep the concept "Keep data in memory once only".
- The behaviour of PB-maps can be kept linear in memory-consumtion and process time and they stay reliable and have interesting features
- My long-time concern "PB internally processes double-sized Unicode only" boils down to a 10% performance degration only. So
we are not stuck with the last Ascii-capable PB5.40. This let's enjoy little goodies like "FormatNumber" for the convenient
display of big resulting numbers and the like in the future.

@To all:
- I will wrap the things together and post the core routine in the forum in the next days
- Since the portion of ASM is reduced quite a lot: Would'nt it be worth to start a new thread under "Coding Questions" and get it out of
the ASM-niche? Even if there come up some extensions with ASM it's much more conventional PB-code now and a general topic of
interest IMO. (There are tons of expanding demands and wishes in this Bio-Informatics field). This is especially the case, because it would be
favorable to visualize the results through PB-graphics, even by letting the user interact manually.

Any ideas or recommendations?

Post by **idle** » Wed May 19, 2021 9:30 am

Post in general discussions perhaps? It's a difficult task but can be easily tamed, though it will still take some time. Trading processing time for memory is more or less a necessity given that the problem has quadratic growth but then the additional time is insignificance to the time it would take to page in memory when memory is scarce.
While a bloom filter uses a lot of cycles it can easily be done in parallel and the additional time to set a key far out weighs the effects of paging memory. If you use a back to back bloom you can get the error rate down to 100 False positives for 100,000,000 items and that can be done in ~200mb. you still need to do the counting in a Map or Trie but a Trie has a couple of advantages over a map in this task.

Sooraa · Post by **Sooraa** » Wed May 19, 2021 4:00 pm

@Idle

Yeah, any time also general.

Perhaps you misunderstood the intend to my last post:
Since it became more and more quiet in the topic, I made a proposal to the state of affairs. If you still see chances for a new approach
in this matter, you are more than welcome.

Your statements to solve the demands of this topic will be right. But when I see your long engagement and experience in blooms etc. in this
forum it is not a good idea that I try to evaluate and test this way on my own. There's simply lack the experience here. In any way, I'm "on your command" to support your ideas.

May be that I misunderstood your comments i.r. to memory. Up to now, we discussed only genome-sizes of 5 mio bps. That's a long way to come
close to the management of human-genome size of 3 Gbps. (luckliy it's under 4 GB) but on the other hand it's just a factor of 600....
It's not my goal to cope with lierature-described procedures using real "iron hardware" of > 512 GB main memory on x64 basis.

As mentioned, the envolvement in "n-grams project" of Google-books showed that the concept time against memory comes to results.

I'm eager to hear any progress in this matter....

wilbert · Post by **wilbert** » Wed May 19, 2021 5:04 pm

I'm also still thinking about the k-mer counting subject.
My main problem is deciding on what approach to use and what output should be generated.
The human-genome size may be 3 Gbps but if you want every count from k=1 to 31, there's not enough system memory to hold all that information.
It would help if the output could be filtered.

Sooraa · Post by **Sooraa** » Wed May 19, 2021 6:43 pm

@Idle, @Wilbert,

Code: Select all

it's good to hear that you stick with the problem.

Wilbert, to clarify it perhaps a bit more and to show, where I am (pseudo code) :

Structure KmerCombo ; it's 2*32 = 64 bits
  StructureUnion
    KmerQuad.q
    DNAindex.l
    K_Count.l
  EndStructureUnion
EndStructure

NewMap KmerMap.Kstruc(15000000 * 2) 

For i = 1 To 31
  For KmerLen = i To i
    For DNAindex = 0 To BytesRead - KmerLen ; (in Case of ecoli = 4.7 mio)
      DNAsearchSeq = PeekS(*Buff + DNAindex, KmerLen, #PB_Ascii) ; PB-strings internally are in Unicode (since 5.50 7/2016), Ascii ----> Unicode
      AddMapElement(DNAmap(), DNAsearchSeq) ;, #PB_Map_NoElementCheck)
      DNAMAP()\K_Count + 1
      If DNAMAP()\K_Count < 1
        DNAMAP()\DNAindex = DNAindex
      EndIf 
    Next DNAindex
  Next KmerLen
  
  ForEach KmerMap
    ; append this single K-mer to output-file...
    ; ... WriteString, WriteData with generating DNAsearchSeq of the first and single DNAindex ....
  Next
  
  ; depending on experiments sort via array before ....
  ClearMap(KmerMap())
Next i

- grouping and sorting in the output is achieved this way
- memory size is all about a single KmerLen, then cleared
- DNAMAP()\DNAindex is accessed only once per DNAsearchSeq
- DNAsearchSeq with a mini-alphabet (ACGT) is implicitly the hash of the KmerMap(), the element is just an q.-integer
- We only grab the unique's from the input file via the DNAindex....
- ALL(!) DNAindexes are over-adressed and so nice to have. If needed, it could be generated easily by a dedicated Findstring kind of thing,
or by "#PB_Map_NoElementCheck" (but this let's explode memory)

Sooraa · Post by **Sooraa** » Wed May 19, 2021 6:47 pm

Gent's, you are smart enough to detect my fault:
DNAMAP()\K_Count + 1
comes three lines later than shown...

Post by **idle** » Thu May 20, 2021 5:27 am

I'm not sure yet if the results are all valid yet but for a test of 1,000,000 items it takes ~11 seconds.
It excludes kmers which only appear once in the sequence and It includes the total counts of k and the individual counts of each repetition as well as thier positions in the sequence.
It's running in a thread for each k, so that's the reason the output here is jumbled

Incidentally it took 102 seconds for 4,641,652 items

k=count trie bloom in mb
17=5 memsize mb 0.00 28.75
2=16 memsize mb 0.00 28.75
15=125 memsize mb 0.05 28.75
3=64 memsize mb 0.00 28.75
16=30 memsize mb 0.01 28.75
23=0 memsize mb 0.00 28.75
14=495 memsize mb 0.16 28.75
20=0 memsize mb 0.00 28.75
4=256 memsize mb 0.01 28.75
18=1 memsize mb 0.00 28.75
29=0 memsize mb 0.00 28.75
13=1904 memsize mb 0.50 28.75
22=0 memsize mb 0.00 28.75
24=0 memsize mb 0.00 28.75
5=1024 memsize mb 0.05 28.75
19=0 memsize mb 0.00 28.75
12=7413 memsize mb 1.49 28.75
28=0 memsize mb 0.00 28.75
26=0 memsize mb 0.00 28.75
27=0 memsize mb 0.00 28.75
25=0 memsize mb 0.00 28.75
21=0 memsize mb 0.00 28.75
6=4096 memsize mb 0.19 28.75
11=27496 memsize mb 3.90 28.75
31=0 memsize mb 0.00 28.75
7=16384 memsize mb 0.75 28.75
10=87309 memsize mb 7.71 28.75
30=0 memsize mb 0.00 28.75
8=65262 memsize mb 2.99 28.75
9=149058 memsize mb 7.99 28.75

Sooraa · Post by **Sooraa** » Thu May 20, 2021 1:33 pm

@Idle

I don't think that I understood everything from your test.
102 sec for 4,641,652 items (sounds for the ecoli dataset) sounds well on the first view.

What does "17=5 memsize mb 0.00 28.75" mean?
17 sounds for k
What is the 5?
What is "mb 0.00"?
What is 28.5?
How many threads were running all together?

It includes the total counts of k and the individual counts of each repetition: What is a repetion in this case?
That all positions are kept sound surprisingly good.

If the results are valid, could we state, that the others of the counted total uniques are the distincts (the ones), or are they lost/unknown?

Sooraa · Post by **Sooraa** » Thu May 20, 2021 1:47 pm

@Idle

I just saw your pm. Probably this reduces the # of my questions.

I'll come back asap.

Sooraa · Post by **Sooraa** » Thu May 27, 2021 4:00 pm

I'm back from a few days at the north coast after the repeated shut-downs here.
Still working to present results.

Sooraa · Post by **Sooraa** » Mon May 31, 2021 4:11 pm

Here my status with handling Kmers of DNA:

Last we had the tests with the genom of C. ecoli (4,7 Mio nucleos)
My testbeds:
4 Haswell-cores, 8 MB L3
16 GB Mem
WIN7-64
PB5.43-64 in Ascii Mode
only 1 CPU enabled, HT switched off

Kmer 1 till 31
Duration: 49 sec. w/o file output of the results
100 sec. with file output
One PB-Map, Map-Size=4.570.777
0.5 GB Memory-usage

C. elegans genome (102 MB nucleos, this is more than the avg. of one of the 46 human Chromosomes
(A chromosome containes genes and can be seen as a separated data-compartment)
102 Mio nucleos = 102 MB DNA-file size

Kmer 1 till 31
Duration: 2.000 sec. w/o file output
One PB-Map, Map-Size=96.000.000
9.9 GB memory usage

- tested, but abolished: 4 Maps in parallel to cut the sizes by 4. This screws the caches in this sizes in a way that it becomes impractical timewise
- realized: PB higher than PB5.43 (unicode only) is 30% slower than PB5.43 in Ascii
- restricting to one CPU doubles the throughput of these kinds of code: 8 MB L3-cache become "private" then and the other three L2-caches can be snooped by the cache-miss logic of the CPU in action. Addl. the thermal reserve the chip-dye gets from 3 cut-off CPUs lets the frequency multiplier go in my case up to 4.9 GHz.
- tested, but abolished: Reverse Kmer-Sequence (31 till 1), no advantage
- PB-Maps are stable under every situation
- Single disadvantage with these map-sizes: Starting from Kmer > 15 the PB-Clear/FreeMap() command becomes 40-50% of the time of a single Kmer evaluation step
- That means that there is a potential improvement of the speed factor of 2

I intend to "polish" the code to be published under MIT licence.

1. Can someone give me hints to convert a structured, but only linear code w/o even procedures into an PB-Module? Thanks in advance!
2. Are there any ideas i.r.t to the unproductive time consumtion while freeing the map? It seems it's a two step process by the way.

@Idle
I'll dive ito your approach asap

Post by **idle** » Tue Jun 01, 2021 7:50 pm

It would really help if you sent us your test code by PM. like your code for reading in the fasta format. It's not exactly collaboration otherwise and I just get the impression you want me to code you a kmer counter.

by the way latest test of k=31 for 2,500,000 BP took 4 seconds but if you want me to test it against real genomes send me your code to read them in and what not.

Sooraa · Post by **Sooraa** » Wed Jun 02, 2021 3:19 pm

There might be a slight misunderstanding:

At the beginning I described here in the board my idea to create a "Kmer-counter" for the desctop/WS arena under OSX/Windows/Linux.
State of the art are solutions with 512 GB memory or Cloud applications.
I described that I/we are in the "feasibilty study"-phase. It doesn't make much sense to come-up with "something" where surprises and limits show up over the coarse of moving forward.

And at that stage we still are:

- Wilbert presented an idea which he clearly recognised as amount-limited and indicated tossing with alternatives
- You came up with adapting your Squint-approach:
SQUINT2 is a 2.000 lines PB-module with estimated 80 almost uncommented procedures in PB, macros, assembler and several different hashing routines
I am sure that you feel fine since you work with trees since 2010. Me not at all, my domain is PB with SSTNI(SSE4.2) since the big data project Google-Books 2010
You kindly adopted SQUINT2 with 250.000 randomly generated nucleos with threading support for the single Kmer 31, which shows impressing throughput behavior.
You posted: "I will do some tests on real data over the next few days..."
- I meanwhile experimented with alternatives at the PB-level and reported the results from time to time, last here at May, 31st, showing that Kmer 1-31 of 110 mio. nucleos (c-.elegans) is achievable.

While doing my experiments it suddenly shows, that from K>15 on the well-running PB-Map is waisting 50% of time to free the single Kmer-Maps. These are the unnecessary traps and surprises I mean. This doesn't have to be a show-stopper, but it is an issue at the minimum.

SQUINT2 should be tested with 110 mio. nucleos in a first step and secondly Kmers 1-31 in a further step.
So I was happy about your announcement to do some tests on real data .....
Now I am a bit irritated that you now got the feeling that you code a Kmer counter for me....

I'm not qualified to expand your code to bring it to stable benchmarks.
The 3 different genom-files are all public: The 5.000 nucleos on phi-X74 is a 10 second cut and paste, the c.ecoli with 4.700.000 nucleos may be 2 minutes to get. The "king-class" c-ecoli may be a half hour preparation.

I even don't have a "fasta"-text-loader, it's not necessary in this phase. What, if it shows that .2bit files are the ultimately better solution?

I hope that you rethink your impression i.r. to cooperation.

A last word to the importance of real DNA-data tests:
1. DNA is (luckily) not random. Kmering is the search for the unknown, like in the science-phase of NLP (natural language processing).
F.e. in c.elegans there are clusters of 230 sequences in K-31 where the (Shannon)-entropy is 0.0000000 (certainty) among 446 incidences of entropy 1.9000000 (almost uncertainty): These are the spots of interest: In NLP this were the long discussed "perplexity" (or surprise) areas. And these incidents are even criss-crossed over different Kmers.....
2. Are there any variables or other object which might "explode" with real data? How to estimate the # of unshown results/counts?

As soon as we achieved a stable "counting-engine" the way to statistical analysis like Zipfian distribution are open.....

Although I am curious, I stopped testing for the time beeing at Kmer 1-31, but am sure that it can go much further. What will be detectable there.....?

wilbert · Post by **wilbert** » Wed Jun 02, 2021 3:49 pm

Where can the C. elegans genome (102 MB nucleos) be found to test with ?
When you output the results to a file, do you output every kmer found ?

I believe I asked you before but can you give any indication on what input size you want to work with.
To me, optimizing for 4.7 MB input, 102 MB or 3000 MB input is very different.
It makes no sense to code something that is very fast on a 4.7 MB input if in reality you want to use it for 102 MB input files.
The same for 102 and 3000 MB. When handling 3000 MB input, on a system with limited memory, the OS needs to swap memory to disk while processing it so in that case you focus on trying to minimize the disk swapping to improve speed while the approach used to do so might decrease the performance for smaller files.

PureBasic Forums - English

A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm