A small procedure asm
Re: A small procedure asm
It's a tricky problem as the memory and execution time grows rapidly as k increases. A bloom filter would help to constrain the memory of a trie or map as you're only interested in kmers that appear more than once. A trie is great for small k but as k increases the memory grows rapidly due to the singleton nodes that aren't shared.
A bloom filter will tell you that an item exists in the set with 100% certainty but it will also return false positives telling you an item exists when it doesn't. To get around this you can use a back to back bloom filter or a counting filter to rule out false positives before adding the item to a Trie or map where you can check it. you can also do set operations on blooms with logical Or, Xor,And which could be very useful.
I did a test of a 250,000 long sequence over a range of k=1 to k=31, it took 30 seconds and used 259.08mb
A bloom filter will tell you that an item exists in the set with 100% certainty but it will also return false positives telling you an item exists when it doesn't. To get around this you can use a back to back bloom filter or a counting filter to rule out false positives before adding the item to a Trie or map where you can check it. you can also do set operations on blooms with logical Or, Xor,And which could be very useful.
I did a test of a 250,000 long sequence over a range of k=1 to k=31, it took 30 seconds and used 259.08mb
Re: A small procedure asm
@Idle,
thank you very much for the conclusions in a trie, even if it does'nt help the k-problem.
"... telling you an item exists when it doesn't." --> My spontaneos thought "OK, we have the potential needles now, let's verify them with something like PCMPEQB or PCMPISTRI" was vanished in light of the amount of verifications quickly.
"To get around this ..." --> I read a bit about to control the blooms, but I am too far away to judge your hint.
" I did a test .... 250.000, k 1 to 31, 30 sec., 260 mb .... " --> These numbers sound more than expectable
thank you very much for the conclusions in a trie, even if it does'nt help the k-problem.
"... telling you an item exists when it doesn't." --> My spontaneos thought "OK, we have the potential needles now, let's verify them with something like PCMPEQB or PCMPISTRI" was vanished in light of the amount of verifications quickly.
"To get around this ..." --> I read a bit about to control the blooms, but I am too far away to judge your hint.
" I did a test .... 250.000, k 1 to 31, 30 sec., 260 mb .... " --> These numbers sound more than expectable
Re: A small procedure asm
A decent bloom filter is probably a must have in your field so it's worth considering.Sooraa wrote: ↑Fri May 07, 2021 7:38 am @Idle,
thank you very much for the conclusions in a trie, even if it does'nt help the k-problem.
"... telling you an item exists when it doesn't." --> My spontaneos thought "OK, we have the potential needles now, let's verify them with something like PCMPEQB or PCMPISTRI" was vanished in light of the amount of verifications quickly.
"To get around this ..." --> I read a bit about to control the blooms, but I am too far away to judge your hint.
The performance is pretty dismal, though that's not surprising given that its not threaded" I did a test .... 250.000, k 1 to 31, 30 sec., 260 mb .... " --> These numbers sound more than expectable
Re: A small procedure asm
@Idle:
By the way:
In light of SARS-COV-2 I find it worth to mention, that the phi-X174 virus with 5.386 bps has specialised to infect the Escheria coli bacteria with 4.7 million bp and so use it's metabolism. That seem to be normal already since millions of years. We could be happy, that we are meanwhile able to technologically intervene in this game of nature.....
I ment it hopeful: 250 KB DNA-Input (the small phi-X174 has only 5.386 bp), the k's from 1 to 31 evaluated, 260 mb memory in 30 secs.... First hand not too bad. I could estimate your idea better, if you could make a test in the size of the ecoli genome with 4.641.652 DNA Input, please.The performance is pretty dismal, though that's not surprising given that its not threaded
By the way:
In light of SARS-COV-2 I find it worth to mention, that the phi-X174 virus with 5.386 bps has specialised to infect the Escheria coli bacteria with 4.7 million bp and so use it's metabolism. That seem to be normal already since millions of years. We could be happy, that we are meanwhile able to technologically intervene in this game of nature.....
Re: A small procedure asm
@Sooraa
I expect 4.641.652 will take a long time. will have a look in an hour or two.
I expect 4.641.652 will take a long time. will have a look in an hour or two.
Re: A small procedure asm
I thought about it some more and I think a dual approach would be best.
For k=1 to k=12 an array will work very fast. For a bigger k a different approach would be needed.
A little question related to asm optimization; do you compile your applications as 32 or 64 bit ?
For k=1 to k=12 an array will work very fast. For a bigger k a different approach would be needed.
A little question related to asm optimization; do you compile your applications as 32 or 64 bit ?
Windows (x64)
Raspberry Pi OS (Arm64)
Raspberry Pi OS (Arm64)
Re: A small procedure asm
@Wilbert,
as a standard, I use 64-bit executables.
I jumped into the topic since the time that the output-capacities of DNA sequencers grew faster than the computing capacities (A single next-generation sequencer easily generates billions of short DNA sequences (approx. length 200 each). This outraces the moore-law in the follow-up computing steps like alignment, error correction etc. till today. Today the analysis is the bottleneck.
One of my secondary motivations is to provide something under "non-linux" desktop environments. Under Linux there are trie- and bloom-filter versions with "big iron" 512-GB memory installations, or AWS-cloud solutions. Not all small labs can afford the fees and cost here.
To your thoughts:
in any case I am interested in any expansion of my knowledge in this matter. So, although my current PB-map approach works decently fast till k=12 I am eager to see your probably smart array-idea.
One thing I didn't stress up to now: To save some resources during k-mer counting (in my environment results are approx. 20%) the canonical counting is used. This is because f.e. ATCGAC is the same as the reverse complement of GTCGAT. The 20% don't help in the big picture, especially it's a single byte problem what in the best could be handled by BMI-instructions (unfortunately just one Execution Unit does this) or a jump table which mixes up the caches/pipelines a lot. In my impression this holds true for the 2-bit DNA file-input: Efficiently compact, but a stepstone due to single byte execution.
as a standard, I use 64-bit executables.
I jumped into the topic since the time that the output-capacities of DNA sequencers grew faster than the computing capacities (A single next-generation sequencer easily generates billions of short DNA sequences (approx. length 200 each). This outraces the moore-law in the follow-up computing steps like alignment, error correction etc. till today. Today the analysis is the bottleneck.
One of my secondary motivations is to provide something under "non-linux" desktop environments. Under Linux there are trie- and bloom-filter versions with "big iron" 512-GB memory installations, or AWS-cloud solutions. Not all small labs can afford the fees and cost here.
To your thoughts:
in any case I am interested in any expansion of my knowledge in this matter. So, although my current PB-map approach works decently fast till k=12 I am eager to see your probably smart array-idea.
One thing I didn't stress up to now: To save some resources during k-mer counting (in my environment results are approx. 20%) the canonical counting is used. This is because f.e. ATCGAC is the same as the reverse complement of GTCGAT. The 20% don't help in the big picture, especially it's a single byte problem what in the best could be handled by BMI-instructions (unfortunately just one Execution Unit does this) or a jump table which mixes up the caches/pipelines a lot. In my impression this holds true for the 2-bit DNA file-input: Efficiently compact, but a stepstone due to single byte execution.
Re: A small procedure asm
I didn't implement the reverse complement for now but could add that in although it will decrease the performance a bit.
I'm very curious how you think my array version (c-style array) performs.
Code: Select all
#x64_ASM_Optimized = #True
Enumeration
#v_A
#v_C
#v_G
#v_T
EndEnumeration
Structure kMem
k.l
count.l[0]
EndStructure
Structure cTable
c.l[256]
EndStructure
DisableDebugger; important for asm code
Procedure kCount(*InputData.Ascii, Encoding = 0, k = 1)
; Use Encoding 0 for Ascii input, 1 for Unicode
Protected cTable.cTable, *kMem.kMem
Protected.l cSize, i, memSize, n
If k >= 1 And k <= 12
; set cSize
If Encoding = 0
cSize = 1; character size 1 byte
Else
cSize = 2; character size 2 bytes
EndIf
; init lookup table
FillMemory(cTable, 1024, -1, #PB_Long)
; set A, C, G and T values
cTable\c['A'] = #v_A
cTable\c['C'] = #v_C
cTable\c['G'] = #v_G
cTable\c['T'] = #v_T
; make lowercase behave the same as uppercase
cTable\c['a'] = cTable\c['A']
cTable\c['c'] = cTable\c['C']
cTable\c['g'] = cTable\c['G']
cTable\c['t'] = cTable\c['T']
; get first k characters
memSize = 4
For i = 1 To k
n = n << 2 | cTable\c[*InputData\a]
If n = -1
ProcedureReturn #Null; invalid input
EndIf
*InputData + cSize
memSize << 2
Next
; allocate memory
*kMem = AllocateMemory(memSize + 4)
If *kMem
*kMem\k = k
mask = 1 << (k * 2) - 1
CompilerIf #PB_Compiler_Processor = #PB_Processor_x64 And #x64_ASM_Optimized
!lea r8, [p.v_cTable]
!mov r9, [p.p_InputData]
!mov r10, [p.p_kMem]
!add r10, 4
!mov r11d, [p.v_mask]
!mov edx, [p.v_n]
!mov ecx, [p.v_cSize]
!.loop:
!add dword [r10 + rdx * 4], 1
!movzx eax, byte [r9]
!shl edx, 2
!and edx, r11d
!add r9, rcx
!or edx, [r8 + rax * 4]
!jns .loop
CompilerElse
Repeat
*kMem\count[n] + 1
n = (n << 2 & mask) | cTable\c[*InputData\a]
*InputData + cSize
Until n = -1
CompilerEndIf
EndIf
EndIf
ProcedureReturn *kMem
EndProcedure
EnableDebugger
Procedure kShowResult(*kMem.kMem)
Protected.l i, k, n, n_, max
Protected Dim nTable.a(3)
nTable(#v_A) = 'A'
nTable(#v_C) = 'C'
nTable(#v_G) = 'G'
nTable(#v_T) = 'T'
If *kMem
k = *kMem\k
Protected Dim asciiBuffer.a(k)
If k >= 1 And k <= 12
max = 1 << (k * 2) - 1
For n = 0 To max
If *kMem\count[n]
; convert n to ascii
n_ = n
For i = k To 1 Step -1
asciiBuffer(i) = nTable(n_ & 3)
n_ >> 2
Next
; debug ascii and count
Debug PeekS(@asciiBuffer(1), k, #PB_Ascii)+" "+ Str(*kMem\count[n])
EndIf
Next
EndIf
EndIf
EndProcedure
; Test the code
s.s = "GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAA"
For k = 1 To 12
*kMem = kCount(@s, #PB_Compiler_Unicode, k)
If *kMem
kShowResult(*kMem)
FreeMemory(*kMem)
EndIf
Next
For larger k values, there seems to come a point where every k-mer is unique.
Is it useful to continue after that point with even larger k values ?
Last edited by wilbert on Sun May 09, 2021 5:11 am, edited 1 time in total.
Windows (x64)
Raspberry Pi OS (Arm64)
Raspberry Pi OS (Arm64)
Re: A small procedure asm
@Wilbert,
first of all, I like your clear coding style.
We can't and don't need to carry the positions of occurence with it. Today I use the position to recover the string in combination of the k-mer length during sorting and file-output. So, for your approch, we don't need them.
Visible uniqueness: The reason is the small sample we/you use up to now. The "beauty of the game of life" shows at higher string-lengths
I don't want to pollute the board with larger data. So I sent you the 5.386 bp phi-X174 genome data as Ascii by personal message. The whole game starts to become interesting with the ecoli genome with 5 mio. bp. I could send it also by PM or by email if you like. Then you see the numbers floating.
You have a better clue where the tables go to. I would test your idea if a little obstacle is solved by you: Your "memSize = 4" creates an error "overflow in an dynamically allocated memory block". If I set it to 5, the error vanishes, but you must have your reasons for the "4".
To better get an overview I integrated
; debug ascii and count
m + 1
;If *kMem\count[n] > 1 ;: Input()
Debug PeekS(@asciiBuffer(1), k, #PB_Ascii)+" "+ Str(*kMem\count[n]) + " " + Str(m)
;EndIf
So please take a look to your PM for phi-X174 and give advice how to transfer ecoli.
first of all, I like your clear coding style.
We can't and don't need to carry the positions of occurence with it. Today I use the position to recover the string in combination of the k-mer length during sorting and file-output. So, for your approch, we don't need them.
Visible uniqueness: The reason is the small sample we/you use up to now. The "beauty of the game of life" shows at higher string-lengths
I don't want to pollute the board with larger data. So I sent you the 5.386 bp phi-X174 genome data as Ascii by personal message. The whole game starts to become interesting with the ecoli genome with 5 mio. bp. I could send it also by PM or by email if you like. Then you see the numbers floating.
You have a better clue where the tables go to. I would test your idea if a little obstacle is solved by you: Your "memSize = 4" creates an error "overflow in an dynamically allocated memory block". If I set it to 5, the error vanishes, but you must have your reasons for the "4".
To better get an overview I integrated
; debug ascii and count
m + 1
;If *kMem\count[n] > 1 ;: Input()
Debug PeekS(@asciiBuffer(1), k, #PB_Ascii)+" "+ Str(*kMem\count[n]) + " " + Str(m)
;EndIf
So please take a look to your PM for phi-X174 and give advice how to transfer ecoli.
Re: A small procedure asm
@Idle,
what happened to your test?
Do you need more input?
I am eager to your observations and hope your system didn't go to smoke...
what happened to your test?
Do you need more input?
I am eager to your observations and hope your system didn't go to smoke...
Re: A small procedure asm
I made a small mistake at the allocation point.
Can you try again ?
Code: Select all
#x64_ASM_Optimized = #True
Enumeration
#v_A
#v_C
#v_G
#v_T
EndEnumeration
Structure kMem
k.l
count.l[0]
EndStructure
Structure cTable
c.l[256]
EndStructure
DisableDebugger; important for asm code
Procedure kCount(*InputData.Ascii, Encoding = 0, k = 1)
; Use Encoding 0 for Ascii input, 1 for Unicode
Protected cTable.cTable, *kMem.kMem
Protected.l cSize, i, memSize, n
If k >= 1 And k <= 12
; set cSize
If Encoding = 0
cSize = 1; character size 1 byte
Else
cSize = 2; character size 2 bytes
EndIf
; init lookup table
FillMemory(cTable, 1024, -1, #PB_Long)
; set A, C, G and T values
cTable\c['A'] = #v_A
cTable\c['C'] = #v_C
cTable\c['G'] = #v_G
cTable\c['T'] = #v_T
; make lowercase behave the same as uppercase
cTable\c['a'] = cTable\c['A']
cTable\c['c'] = cTable\c['C']
cTable\c['g'] = cTable\c['G']
cTable\c['t'] = cTable\c['T']
; get first k characters
memSize = 4
For i = 1 To k
n = n << 2 | cTable\c[*InputData\a]
If n = -1
ProcedureReturn #Null; invalid input
EndIf
*InputData + cSize
memSize << 2
Next
; allocate memory
*kMem = AllocateMemory(memSize + 4)
If *kMem
*kMem\k = k
mask = 1 << (k * 2) - 1
CompilerIf #PB_Compiler_Processor = #PB_Processor_x64 And #x64_ASM_Optimized
!lea r8, [p.v_cTable]
!mov r9, [p.p_InputData]
!mov r10, [p.p_kMem]
!add r10, 4
!mov r11d, [p.v_mask]
!mov edx, [p.v_n]
!mov ecx, [p.v_cSize]
!.loop:
!add dword [r10 + rdx * 4], 1
!movzx eax, byte [r9]
!shl edx, 2
!and edx, r11d
!add r9, rcx
!or edx, [r8 + rax * 4]
!jns .loop
CompilerElse
Repeat
*kMem\count[n] + 1
n = (n << 2 & mask) | cTable\c[*InputData\a]
*InputData + cSize
Until n = -1
CompilerEndIf
EndIf
EndIf
ProcedureReturn *kMem
EndProcedure
EnableDebugger
Procedure kShowResult(*kMem.kMem, threshold = 0)
Protected.l i, k, max, n, n_, t
Protected Dim nTable.a(3)
nTable(#v_A) = 'A'
nTable(#v_C) = 'C'
nTable(#v_G) = 'G'
nTable(#v_T) = 'T'
If *kMem
k = *kMem\k
Protected Dim asciiBuffer.a(k)
If k >= 1 And k <= 12
max = 1 << (k * 2) - 1
For n = 0 To max
If *kMem\count[n]
t + 1
If *kMem\count[n] > threshold
; convert n to ascii
n_ = n
For i = k To 1 Step -1
asciiBuffer(i) = nTable(n_ & 3)
n_ >> 2
Next
; debug ascii and count
Debug PeekS(@asciiBuffer(1), k, #PB_Ascii)+" "+ Str(*kMem\count[n])+" "+Str(t)
EndIf
EndIf
Next
EndIf
EndIf
EndProcedure
; Test the code
s.s = "GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAA"
For k = 1 To 12
*kMem = kCount(@s, #PB_Compiler_Unicode, k)
If *kMem
kShowResult(*kMem, 1); show only those with count above threshold value
FreeMemory(*kMem)
EndIf
Next
It might seem a bit slow now because it loops over all possible combinations when showing the results.
But this is only the case with debugger enabled. If you want the counts returned as a PB list for example or output the info to the console so it can be compiled with debugger off, it would be faster.
Do you internally use a list to store the results ?
And if so, what fields do you store ?
I could for example output a list with only the ones with a count above 1 and the position where they occur.
I was also wondering in "KmerLen: 1-1, 1-1-Frequency: 4, file-size/nucleotids: 75" what the frequency means.
You could give me a download link for the ecoli file or use a service like https://wetransfer.com/
Windows (x64)
Raspberry Pi OS (Arm64)
Raspberry Pi OS (Arm64)
Re: A small procedure asm
The test would have exceeded my available ram. I will do some tests with the bloom filter from my bitmodule.
viewtopic.php?f=12&t=57409
The bloom filter could be improved but it would be a good test for it
Re: A small procedure asm
@Wilbert,
Test results:
- No errors observed
- The results of 5.386 bp phi-X174 genome with your pgm are the same as with mine pgm
- My pgm needs 43 ms to produce the results on phi-X174, your pgm needs 77 ms
- Memory demands: on phi-X174 about the same
Your questions:
- 1-1-Frequency: 4, file-size/nucleotids: 75 means the size of our data sample of the first 75 bytes of 4,7 mb ecoli. 1-1 means from k-mer 1 to k-mer 1 and produces the frequency "4", because as 1-2 only the 4 A,C,G,Ts are processed. The 1-2-Frequency produces "16" as 4^2 combinations because all are found and 2-3 shows "57" as the number of combinations found in the 75 byte dataset.
- I do not use any lists yet, I sort & group the results in an array. Lists use a foreward- and a backward pointer. But let's see, because the interesting part comes now:
With 4,7 mb Ecoli bacteria data it will show up, which memory- and timing differences to expect using larger datasets.
With "wetransfer" I would also need your eMail address. So I give you the direct download address of the source:
https://www.ncbi.nlm.nih.gov/nuccore/U0 ... port=fasta
Yust cut the Ascii-data & paste it to any editor.
Since PB-strings are limited to 8.192 chars you would probably need to load with ReadFile and ReadData. I use it this way and then I am more than eager to see, how your pgm-approach behaves. My time for the ecoli dataset is 14 sec. for a 1-12 run. A run with up to 1-16 finishes in reasonable time.
Test results:
- No errors observed
- The results of 5.386 bp phi-X174 genome with your pgm are the same as with mine pgm
- My pgm needs 43 ms to produce the results on phi-X174, your pgm needs 77 ms
- Memory demands: on phi-X174 about the same
Your questions:
- 1-1-Frequency: 4, file-size/nucleotids: 75 means the size of our data sample of the first 75 bytes of 4,7 mb ecoli. 1-1 means from k-mer 1 to k-mer 1 and produces the frequency "4", because as 1-2 only the 4 A,C,G,Ts are processed. The 1-2-Frequency produces "16" as 4^2 combinations because all are found and 2-3 shows "57" as the number of combinations found in the 75 byte dataset.
- I do not use any lists yet, I sort & group the results in an array. Lists use a foreward- and a backward pointer. But let's see, because the interesting part comes now:
With 4,7 mb Ecoli bacteria data it will show up, which memory- and timing differences to expect using larger datasets.
With "wetransfer" I would also need your eMail address. So I give you the direct download address of the source:
https://www.ncbi.nlm.nih.gov/nuccore/U0 ... port=fasta
Yust cut the Ascii-data & paste it to any editor.
Since PB-strings are limited to 8.192 chars you would probably need to load with ReadFile and ReadData. I use it this way and then I am more than eager to see, how your pgm-approach behaves. My time for the ecoli dataset is 14 sec. for a 1-12 run. A run with up to 1-16 finishes in reasonable time.
Re: A small procedure asm
@Wilbert,
I may be this is of interest:
phi-X174 produces a frequency of 34.181
ecoli produces a frequency of 6.917.666
I may be this is of interest:
phi-X174 produces a frequency of 34.181
ecoli produces a frequency of 6.917.666
Re: A small procedure asm
The problem with timings is that you should compile with debugger disabled but that way you can't show the debug data.
When I do only the counting without presenting the results, on my computer the timings with my method for a 1-12 run are
Phi-X174, 14 milliseconds
Ecoli, 230 milliseconds
What I still don't understand is what information you store in your array.
Do you store for example only the counts above 1 or not ?
If for example you use a structured array with fields for k value, count and the sequence you will get a very large array.
If you also want to process files that are very large like the human genome you mentioned, don't you run out of system memory if you try to store every count in your array ?
When I do only the counting without presenting the results, on my computer the timings with my method for a 1-12 run are
Phi-X174, 14 milliseconds
Ecoli, 230 milliseconds
What I still don't understand is what information you store in your array.
Do you store for example only the counts above 1 or not ?
If for example you use a structured array with fields for k value, count and the sequence you will get a very large array.
If you also want to process files that are very large like the human genome you mentioned, don't you run out of system memory if you try to store every count in your array ?
Windows (x64)
Raspberry Pi OS (Arm64)
Raspberry Pi OS (Arm64)