A small procedure asm

Post by **idle** » Fri May 07, 2021 4:25 am

It's a tricky problem as the memory and execution time grows rapidly as k increases. A bloom filter would help to constrain the memory of a trie or map as you're only interested in kmers that appear more than once. A trie is great for small k but as k increases the memory grows rapidly due to the singleton nodes that aren't shared.

A bloom filter will tell you that an item exists in the set with 100% certainty but it will also return false positives telling you an item exists when it doesn't. To get around this you can use a back to back bloom filter or a counting filter to rule out false positives before adding the item to a Trie or map where you can check it. you can also do set operations on blooms with logical Or, Xor,And which could be very useful.

I did a test of a 250,000 long sequence over a range of k=1 to k=31, it took 30 seconds and used 259.08mb

Sooraa · Post by **Sooraa** » Fri May 07, 2021 7:38 am

@Idle,

thank you very much for the conclusions in a trie, even if it does'nt help the k-problem.

"... telling you an item exists when it doesn't." --> My spontaneos thought "OK, we have the potential needles now, let's verify them with something like PCMPEQB or PCMPISTRI" was vanished in light of the amount of verifications quickly.

"To get around this ..." --> I read a bit about to control the blooms, but I am too far away to judge your hint.

" I did a test .... 250.000, k 1 to 31, 30 sec., 260 mb .... " --> These numbers sound more than expectable

Post by **idle** » Fri May 07, 2021 8:12 am

Sooraa wrote: ↑Fri May 07, 2021 7:38 am @Idle,

thank you very much for the conclusions in a trie, even if it does'nt help the k-problem.

"... telling you an item exists when it doesn't." --> My spontaneos thought "OK, we have the potential needles now, let's verify them with something like PCMPEQB or PCMPISTRI" was vanished in light of the amount of verifications quickly.

"To get around this ..." --> I read a bit about to control the blooms, but I am too far away to judge your hint.

A decent bloom filter is probably a must have in your field so it's worth considering.

" I did a test .... 250.000, k 1 to 31, 30 sec., 260 mb .... " --> These numbers sound more than expectable

The performance is pretty dismal, though that's not surprising given that its not threaded

Sooraa · Post by **Sooraa** » Fri May 07, 2021 9:04 am

@Idle:

The performance is pretty dismal, though that's not surprising given that its not threaded

I ment it hopeful: 250 KB DNA-Input (the small phi-X174 has only 5.386 bp), the k's from 1 to 31 evaluated, 260 mb memory in 30 secs.... First hand not too bad. I could estimate your idea better, if you could make a test in the size of the ecoli genome with 4.641.652 DNA Input, please.

By the way:
In light of SARS-COV-2 I find it worth to mention, that the phi-X174 virus with 5.386 bps has specialised to infect the Escheria coli bacteria with 4.7 million bp and so use it's metabolism. That seem to be normal already since millions of years. We could be happy, that we are meanwhile able to technologically intervene in this game of nature.....

Post by **idle** » Fri May 07, 2021 9:21 am

@Sooraa
I expect 4.641.652 will take a long time. will have a look in an hour or two.

wilbert · Post by **wilbert** » Sat May 08, 2021 5:32 am

I thought about it some more and I think a dual approach would be best.
For k=1 to k=12 an array will work very fast. For a bigger k a different approach would be needed.

A little question related to asm optimization; do you compile your applications as 32 or 64 bit ?

Sooraa · Post by **Sooraa** » Sat May 08, 2021 9:14 am

@Wilbert,

as a standard, I use 64-bit executables.

I jumped into the topic since the time that the output-capacities of DNA sequencers grew faster than the computing capacities (A single next-generation sequencer easily generates billions of short DNA sequences (approx. length 200 each). This outraces the moore-law in the follow-up computing steps like alignment, error correction etc. till today. Today the analysis is the bottleneck.

One of my secondary motivations is to provide something under "non-linux" desktop environments. Under Linux there are trie- and bloom-filter versions with "big iron" 512-GB memory installations, or AWS-cloud solutions. Not all small labs can afford the fees and cost here.

To your thoughts:
in any case I am interested in any expansion of my knowledge in this matter. So, although my current PB-map approach works decently fast till k=12 I am eager to see your probably smart array-idea.

One thing I didn't stress up to now: To save some resources during k-mer counting (in my environment results are approx. 20%) the canonical counting is used. This is because f.e. ATCGAC is the same as the reverse complement of GTCGAT. The 20% don't help in the big picture, especially it's a single byte problem what in the best could be handled by BMI-instructions (unfortunately just one Execution Unit does this) or a jump table which mixes up the caches/pipelines a lot. In my impression this holds true for the 2-bit DNA file-input: Efficiently compact, but a stepstone due to single byte execution.

wilbert · Post by **wilbert** » Sat May 08, 2021 5:35 pm

Sooraa wrote: ↑Sat May 08, 2021 9:14 amalthough my current PB-map approach works decently fast till k=12 I am eager to see your probably smart array-idea.

I didn't implement the reverse complement for now but could add that in although it will decrease the performance a bit.
I'm very curious how you think my array version (c-style array) performs.

Code: Select all

#x64_ASM_Optimized = #True

Enumeration
  #v_A
  #v_C
  #v_G
  #v_T
EndEnumeration

Structure kMem
  k.l
  count.l[0]
EndStructure

Structure cTable
  c.l[256]
EndStructure



DisableDebugger; important for asm code

Procedure kCount(*InputData.Ascii, Encoding = 0, k = 1)
  ; Use Encoding 0 for Ascii input, 1 for Unicode
  
  Protected cTable.cTable, *kMem.kMem
  Protected.l cSize, i, memSize, n
  
  If k >= 1 And k <= 12
    
    ; set cSize
    If Encoding = 0
      cSize = 1; character size 1 byte
    Else
      cSize = 2; character size 2 bytes
    EndIf
    
    ; init lookup table
    FillMemory(cTable, 1024, -1, #PB_Long)
    
    ; set A, C, G and T values
    cTable\c['A'] = #v_A
    cTable\c['C'] = #v_C
    cTable\c['G'] = #v_G
    cTable\c['T'] = #v_T
    
    ; make lowercase behave the same as uppercase
    cTable\c['a'] = cTable\c['A']
    cTable\c['c'] = cTable\c['C']
    cTable\c['g'] = cTable\c['G']
    cTable\c['t'] = cTable\c['T']
    
    ; get first k characters
    memSize = 4
    For i = 1 To k
      n = n << 2 | cTable\c[*InputData\a]
      If n = -1
        ProcedureReturn #Null; invalid input
      EndIf
      *InputData + cSize
      memSize << 2
    Next
    
    ; allocate memory
    *kMem = AllocateMemory(memSize + 4)
    If *kMem
      *kMem\k = k
      mask = 1 << (k * 2) - 1
      
      CompilerIf #PB_Compiler_Processor = #PB_Processor_x64 And #x64_ASM_Optimized
        !lea r8, [p.v_cTable]
        !mov r9, [p.p_InputData]
        !mov r10, [p.p_kMem]
        !add r10, 4
        !mov r11d, [p.v_mask]
        !mov edx, [p.v_n]
        !mov ecx, [p.v_cSize]
        !.loop:
        !add dword [r10 + rdx * 4], 1
        !movzx eax, byte [r9]
        !shl edx, 2
        !and edx, r11d
        !add r9, rcx     
        !or edx, [r8 + rax * 4]
        !jns .loop
        
      CompilerElse
        Repeat
          *kMem\count[n] + 1
          n = (n << 2 & mask) | cTable\c[*InputData\a]
          *InputData + cSize        
        Until n = -1
        
      CompilerEndIf
      
    EndIf
    
  EndIf
  ProcedureReturn *kMem
  
EndProcedure



EnableDebugger

Procedure kShowResult(*kMem.kMem)
  
  Protected.l i, k, n, n_, max
  Protected Dim nTable.a(3)
  
  nTable(#v_A) = 'A'
  nTable(#v_C) = 'C'
  nTable(#v_G) = 'G'
  nTable(#v_T) = 'T'
  
  If *kMem
    k = *kMem\k
    
    Protected Dim asciiBuffer.a(k)
    
    If k >= 1 And k <= 12
      max = 1 << (k * 2) - 1
      For n = 0 To max
        If *kMem\count[n]
          ; convert n to ascii
          n_ = n
          For i = k To 1 Step -1
            asciiBuffer(i) = nTable(n_ & 3)
            n_ >> 2
          Next
          
          ; debug ascii and count
          Debug PeekS(@asciiBuffer(1), k, #PB_Ascii)+" "+ Str(*kMem\count[n])
          
        EndIf
      Next
    EndIf
  EndIf
  
EndProcedure




; Test the code

s.s = "GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAA"
For k = 1 To 12
  *kMem = kCount(@s, #PB_Compiler_Unicode, k)
  If *kMem
    kShowResult(*kMem)
    FreeMemory(*kMem)
  EndIf
Next

Is it important to store the position of the first occurrence as well ?
For larger k values, there seems to come a point where every k-mer is unique.
Is it useful to continue after that point with even larger k values ?

Sooraa · Post by **Sooraa** » Sat May 08, 2021 10:23 pm

@Wilbert,
first of all, I like your clear coding style.

We can't and don't need to carry the positions of occurence with it. Today I use the position to recover the string in combination of the k-mer length during sorting and file-output. So, for your approch, we don't need them.

Visible uniqueness: The reason is the small sample we/you use up to now. The "beauty of the game of life" shows at higher string-lengths

I don't want to pollute the board with larger data. So I sent you the 5.386 bp phi-X174 genome data as Ascii by personal message. The whole game starts to become interesting with the ecoli genome with 5 mio. bp. I could send it also by PM or by email if you like. Then you see the numbers floating.

You have a better clue where the tables go to. I would test your idea if a little obstacle is solved by you: Your "memSize = 4" creates an error "overflow in an dynamically allocated memory block". If I set it to 5, the error vanishes, but you must have your reasons for the "4".

To better get an overview I integrated
; debug ascii and count
m + 1
;If *kMem\count[n] > 1 ;: Input()
Debug PeekS(@asciiBuffer(1), k, #PB_Ascii)+" "+ Str(*kMem\count[n]) + " " + Str(m)
;EndIf

So please take a look to your PM for phi-X174 and give advice how to transfer ecoli.

Sooraa · Post by **Sooraa** » Sat May 08, 2021 10:56 pm

@Idle,
what happened to your test?
Do you need more input?

I am eager to your observations and hope your system didn't go to smoke...

wilbert · Post by **wilbert** » Sun May 09, 2021 5:23 am

Sooraa wrote: ↑Sat May 08, 2021 10:23 pmI would test your idea if a little obstacle is solved by you: Your "memSize = 4" creates an error "overflow in an dynamically allocated memory block". If I set it to 5, the error vanishes, but you must have your reasons for the "4".

I made a small mistake at the allocation point.
Can you try again ?

Code: Select all

#x64_ASM_Optimized = #True

Enumeration
  #v_A
  #v_C
  #v_G
  #v_T
EndEnumeration

Structure kMem
  k.l
  count.l[0]
EndStructure

Structure cTable
  c.l[256]
EndStructure



DisableDebugger; important for asm code

Procedure kCount(*InputData.Ascii, Encoding = 0, k = 1)
  ; Use Encoding 0 for Ascii input, 1 for Unicode
  
  Protected cTable.cTable, *kMem.kMem
  Protected.l cSize, i, memSize, n
  
  If k >= 1 And k <= 12
    
    ; set cSize
    If Encoding = 0
      cSize = 1; character size 1 byte
    Else
      cSize = 2; character size 2 bytes
    EndIf
    
    ; init lookup table
    FillMemory(cTable, 1024, -1, #PB_Long)
    
    ; set A, C, G and T values
    cTable\c['A'] = #v_A
    cTable\c['C'] = #v_C
    cTable\c['G'] = #v_G
    cTable\c['T'] = #v_T
    
    ; make lowercase behave the same as uppercase
    cTable\c['a'] = cTable\c['A']
    cTable\c['c'] = cTable\c['C']
    cTable\c['g'] = cTable\c['G']
    cTable\c['t'] = cTable\c['T']
    
    ; get first k characters
    memSize = 4
    For i = 1 To k
      n = n << 2 | cTable\c[*InputData\a]
      If n = -1
        ProcedureReturn #Null; invalid input
      EndIf
      *InputData + cSize
      memSize << 2
    Next
    
    ; allocate memory
    *kMem = AllocateMemory(memSize + 4)
    If *kMem
      *kMem\k = k
      mask = 1 << (k * 2) - 1
      
      CompilerIf #PB_Compiler_Processor = #PB_Processor_x64 And #x64_ASM_Optimized
        !lea r8, [p.v_cTable]
        !mov r9, [p.p_InputData]
        !mov r10, [p.p_kMem]
        !add r10, 4
        !mov r11d, [p.v_mask]
        !mov edx, [p.v_n]
        !mov ecx, [p.v_cSize]
        !.loop:
        !add dword [r10 + rdx * 4], 1
        !movzx eax, byte [r9]
        !shl edx, 2
        !and edx, r11d
        !add r9, rcx     
        !or edx, [r8 + rax * 4]
        !jns .loop
        
      CompilerElse
        Repeat
          *kMem\count[n] + 1
          n = (n << 2 & mask) | cTable\c[*InputData\a]
          *InputData + cSize        
        Until n = -1
        
      CompilerEndIf
      
    EndIf
    
  EndIf
  ProcedureReturn *kMem
  
EndProcedure



EnableDebugger

Procedure kShowResult(*kMem.kMem, threshold = 0)
  
  Protected.l i, k, max, n, n_, t
  Protected Dim nTable.a(3)
  
  nTable(#v_A) = 'A'
  nTable(#v_C) = 'C'
  nTable(#v_G) = 'G'
  nTable(#v_T) = 'T'
  
  If *kMem
    k = *kMem\k
    
    Protected Dim asciiBuffer.a(k)
    
    If k >= 1 And k <= 12
      max = 1 << (k * 2) - 1
      For n = 0 To max
        If *kMem\count[n]
          t + 1
          If *kMem\count[n] > threshold
            
            ; convert n to ascii
            n_ = n
            For i = k To 1 Step -1
              asciiBuffer(i) = nTable(n_ & 3)
              n_ >> 2
            Next
            
            ; debug ascii and count
            
            Debug PeekS(@asciiBuffer(1), k, #PB_Ascii)+" "+ Str(*kMem\count[n])+" "+Str(t)
            
          EndIf
        EndIf
      Next
    EndIf
  EndIf
  
EndProcedure




; Test the code

s.s = "GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAA"
For k = 1 To 12
  *kMem = kCount(@s, #PB_Compiler_Unicode, k)
  If *kMem
    kShowResult(*kMem, 1); show only those with count above threshold value
    FreeMemory(*kMem)
  EndIf
Next

I added a threshold now so you can set above which value you want the results to be displayed.
It might seem a bit slow now because it loops over all possible combinations when showing the results.
But this is only the case with debugger enabled. If you want the counts returned as a PB list for example or output the info to the console so it can be compiled with debugger off, it would be faster.
Do you internally use a list to store the results ?
And if so, what fields do you store ?
I could for example output a list with only the ones with a count above 1 and the position where they occur.
I was also wondering in "KmerLen: 1-1, 1-1-Frequency: 4, file-size/nucleotids: 75" what the frequency means.

You could give me a download link for the ecoli file or use a service like https://wetransfer.com/

Post by **idle** » Sun May 09, 2021 8:04 am

Sooraa wrote: ↑Sat May 08, 2021 10:56 pm @Idle,
what happened to your test?
Do you need more input?

I am eager to your observations and hope your system didn't go to smoke...

The test would have exceeded my available ram. I will do some tests with the bloom filter from my bitmodule.
viewtopic.php?f=12&t=57409
The bloom filter could be improved but it would be a good test for it

Sooraa · Post by **Sooraa** » Sun May 09, 2021 4:47 pm

@Wilbert,
Test results:
- No errors observed
- The results of 5.386 bp phi-X174 genome with your pgm are the same as with mine pgm
- My pgm needs 43 ms to produce the results on phi-X174, your pgm needs 77 ms
- Memory demands: on phi-X174 about the same

Your questions:
- 1-1-Frequency: 4, file-size/nucleotids: 75 means the size of our data sample of the first 75 bytes of 4,7 mb ecoli. 1-1 means from k-mer 1 to k-mer 1 and produces the frequency "4", because as 1-2 only the 4 A,C,G,Ts are processed. The 1-2-Frequency produces "16" as 4^2 combinations because all are found and 2-3 shows "57" as the number of combinations found in the 75 byte dataset.
- I do not use any lists yet, I sort & group the results in an array. Lists use a foreward- and a backward pointer. But let's see, because the interesting part comes now:
With 4,7 mb Ecoli bacteria data it will show up, which memory- and timing differences to expect using larger datasets.
With "wetransfer" I would also need your eMail address. So I give you the direct download address of the source:
https://www.ncbi.nlm.nih.gov/nuccore/U0 ... port=fasta
Yust cut the Ascii-data & paste it to any editor.
Since PB-strings are limited to 8.192 chars you would probably need to load with ReadFile and ReadData. I use it this way and then I am more than eager to see, how your pgm-approach behaves. My time for the ecoli dataset is 14 sec. for a 1-12 run. A run with up to 1-16 finishes in reasonable time.

Sooraa · Post by **Sooraa** » Sun May 09, 2021 6:04 pm

@Wilbert,
I may be this is of interest:
phi-X174 produces a frequency of 34.181
ecoli produces a frequency of 6.917.666

wilbert · Post by **wilbert** » Sun May 09, 2021 6:09 pm

The problem with timings is that you should compile with debugger disabled but that way you can't show the debug data.
When I do only the counting without presenting the results, on my computer the timings with my method for a 1-12 run are
Phi-X174, 14 milliseconds
Ecoli, 230 milliseconds

What I still don't understand is what information you store in your array.
Do you store for example only the counts above 1 or not ?
If for example you use a structured array with fields for k value, count and the sequence you will get a very large array.
If you also want to process files that are very large like the human genome you mentioned, don't you run out of system memory if you try to store every count in your array ?

PureBasic Forums - English

A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm