A small procedure asm

Bare metal programming in PureBasic, for experienced users
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: A small procedure asm

Post by Sooraa »

@Wilbert
Right now the timing with phi-X174 is not the most important, because it's small and 14 ms vs. 43 ms is close together.
More meaningful is the timing with ecoli which is a thousandfold larger. Here I hoped a lot to your new approch, but the result is striking: As
I described the 1-12 ecoli with my PB-map-pgm lasting 14.000 ms against your 230 ms!!! This a factor of roughly 60 !! Congratulations!

For today it's rather late meanwhile and I would like to take a look to your ecoli-enabled version and take a look to the internals, and to answer your questions at it's best.

Just a quick and dirty extract of my array-structure:
Structure KarrayStruc
totalCount.q
distinctCount.a
uniqueCount.a
DNAindex.l
DNAsearchSeqLen.u
DNAsearchSeq.s
EndStructure
If I could change the concept, my has to be rethought in general anyway.

A last info for today: Currently I tested on ecoli with up to 1-16, starting with 1-17 the RAM & duration stalls the run.
There is no way to experiment with the human genome's 3 GB the next time...
Good night and thank you so far!
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: A small procedure asm

Post by Sooraa »

That happens with quick and dirty: "DNAsearchSeq.s" in the structure is not used, it is extracted from the ReadData-Buffer with DNAindex and k-mer length. At that point I saw your 2015-ASM and contacted you.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

It's very important to think through in advance what you want from the collected information.

Your KarrayStruc with the DNAsearchSeq field removed takes up 16 bytes.
When you process ecoli, after 12 runs the array would be 105 megabyte in size.
I estimate after 31 runs it would be over a gigabyte.
Working with such large structured arrays is almost impossible.
Lists are faster when it comes to sorting them when you want to group but take up even more memory.

I assume you have a goal.
If the goal needs all data, it is what it is and you can only process files that are not very big.
In this page referred to before
https://bioinfologics.github.io/post/20 ... roduction/
they create a table from k = 1 to k = 15 with k, unique, distinct and total counts.
If something like this is your goal, or for example find the shortest unique sequence in the file, you wouldn't need to keep all data and can process bigger files.
Windows (x64)
Raspberry Pi OS (Arm64)
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: A small procedure asm

Post by Sooraa »

@Wilbert

Sure, you are right, there should be a goal. Of course one could define this as "Count & qualify uniques in k-mer up to 1-31 of the human genome presented in state of the art graphics in an desktop environment under (small & fast) PB-Windows". My inspiration was the backlog of Moore's law against next gen. sequencer output). But what does it help, if I don't know the underlying facts? What if theoreticly "possible" is too much to that what I want or can carry out?

My current stage is a feasability study type of thing by finding the "k-mer facts" through experiments and benchmarks with other concepts. That was/is my stage when I saw your 2015-ASM and contacted you. Up to now I have findings in the PB-map features, the arrays and their sorting limits, 2-bit processing, PCMP*STR* concepts and BMI instructions. Idle and you made limiting estimations on trees and Idle is trying to "k-mer" DNA with bloom-filters.

Now you presented another smart approch with promising features. 1-12 does not achieve enough. I have no real clue what happens on the way to 1-31 on bigger datasets up to now, although your 230 ms with ecoli ist an impressing hit.

As soon as I have enough facts, it is the time to define feasible goals in that matter. Don't you agree?

P.S. Your estimate in processing ecoli is right, it produces a 260 mb file sorted and grouped w/o canonical "compression".
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

Sooraa wrote: Mon May 10, 2021 7:56 amNow you presented another smart approch with promising features. 1-12 does not achieve enough. I have no real clue what happens on the way to 1-31 on bigger datasets up to now, although your 230 ms with ecoli is an impressing hit.
The approach I used, uses counters for all possible combinations.
It is a fast approach but only for the lower k values as the amount of memory it requires is four times that of the previous k.
For k = 11 it uses 16 mb of working memory.
For k = 12 it uses 64 mb.
For k = 13 it would use 256 mb.
So that approach won't go all the way up to k = 31.
I have an idea about being able to process ecoli up to k = 31 but I have no idea at the moment how much slower it would be.
At the moment it is still an idea that I will have to verify to see if it works. :shock:
Windows (x64)
Raspberry Pi OS (Arm64)
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: A small procedure asm

Post by Sooraa »

@Wilbert

Unfortunately the reply function swallowed my last reply-draft. So once again:

Ad hoc I can confirm the memory requirements of my more or less "conventional" PB-map/array approach for ecoli 1-12.

In any way I don't want to disturb your creative phase for a new idea, but I want to mention that I found a test of mine with ecoli with a step-by step file output (I think jellyfish does it) as a backup approach:
148 sec. total for 1-30 step by step file output (1-1, 2-2, 3-3 etc.) which results in 3,6 GB total, 30-30 alone with 222 mb. This would have to be consolidated to one result-file.

Don't you think it would be beneficial for stepping foreward, if you provide the source for your 230 ms ecoli 1-12 version? This would also allow me to go to larger genomes for our findings meanwhile ...

Keeping the fingers crossed for your new idea....
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

Sooraa wrote: Mon May 10, 2021 7:29 pmDon't you think it would be beneficial for stepping foreward, if you provide the source for your 230 ms ecoli 1-12 version? This would also allow me to go to larger genomes for our findings meanwhile ...
It's the code I already posted.
I just timed the kCount procedure on the ecoli genome for k = 1 to 12.
With this method the time the counting takes is more or less linear so a genome that is ten times larger should take 10 times longer to process.

You are talking about larger genomes. What do you have in mind ?
Would processing a genome of 50.000.000 bp be sufficient or even bigger ?
Would it be any useful if you could use bigger k values like k = 40 if it would make the approach slower ?
Windows (x64)
Raspberry Pi OS (Arm64)
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: A small procedure asm

Post by Sooraa »

QWilbert

As I descibed, it's the fact finding stage. "The more and the quicker" is the goal, because a next generation illumina sequencer is able to produce more than 2 terabases of data per week (Illumina sequencing is a analog process)

Up to now I don't know how many machines are working around the globe but since the prices of this equipment dropped dynamically and the biologists
generate sequences in different qualities for more and more creatures it leads to the situation that the computing capabilities can't hold up with the
data which is generated. And not all sequences are shared among the bio-community. This "race" will continue the battle-field is open.

There are the viruses: they range between 1.7 Kb RNA only (hepatitis D) to 2.8 for pandora virus
The bacteriae range from 160 kbp for C.ruddi to 13 Mbp for S.cellulosum
Archae: From 490 kbp to 1.9 Mpp
Eukariots: From 400 Mbps to 130 Gbp (lungfish) up to the herb plant paris japonica with 150 Gbp
And then we have creatures like C.elegans with 100 Mps till H.sapiens with 3.2 Gbp

To try to be informative: All these up to now known creatures differentiate between 1 chromosome and for humans 46 chromosomes. In the chromosomes there are the Genes (approx. 20.000 for humans).
Althoug only 2 % of them are protein coding, 80% were called "junk DNA until recently, since informatics show more and more interactions among them. This is a field of current science.

Wether 50.000.000 bp is the upper cap, I don't know. The demands will increase, that's for sure. The first special hardware FPGA-Adapter as accelerators show up.
It is my opinion, that we realisticly should catch up to the state of the art situation, which is K=31. Especially the functions of "Junk-DNA" are not understood till now and this is a result of the limiting
factor is the computer-hardware limit of 64-bit systems, which leads to the K=31 cap. I myself produced a single K=128-128 of ecoli with 700 mb output.

Unfortunately I can' say the upper limits, but K=31 would be one starting-goal. In the different creature-arenas we must find the cap for the current x64 environments.

If you put the system-managing overhead like maps, trees, blooms and arrays aside for a moment, my basic theory is: DNA is only numbers which x64 can handle quite well with multiple EUs/ports and can be multiplied by the number of cores) The 512-bit instructions would bring us to the field of SIMS. But here I stop thinking.

On top of that "tolerant-matching" could expand the current quality of matching K-mers. (I brought an experimental "one row" Levenstein test to run, no matrix). But that's future.
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: A small procedure asm

Post by Sooraa »

@Wilbert

As a fallback I tried yor version with k=31-31 for step by step K-mer, but it didn't work?
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

Sooraa wrote: Tue May 11, 2021 10:23 amAs a fallback I tried you version with k=31-31 for step by step K-mer, but it didn't work?
That's right. I told you that it can't handle such a high k.

A little additional question ...
With the applications you use, is generating a list with canonical reverse faster or slower compared to generating a list without canonical reverse ?
Windows (x64)
Raspberry Pi OS (Arm64)
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: A small procedure asm

Post by Sooraa »

@Wilbert
yes, you did.

With canonical reverse it's approx. 20% slower. Conceptional code:
RComplement = DNAsearchSeq

For z = 0 To KmerLen - 1
; Ascii PB5.40 ; XLAT single-byte-instruction, mit 7/1c oder 7/2 with small table in DataSection
H_DIGIT = PeekA(@RComplement + z) - 65
MOV AL, [v_H_DIGIT] ; Position of table entry
LEA RBX, [HA_TABLE]
XLATB ; Now AL contains ASCII code; nur 1 byte code
MOV [v_ASC_DIGIT], AL ; Store it
PokeA(@RComplement + z, ASC_DIGIT)

; Unicode PB5.72
y = z << 1
;Select Mid(RComplement, z + 1, 1) ; 86 sec bei 1-15
Select PeekU(@RComplement + y) ; 42 sec bei 1-15 ; if match, exits
Case 'C' : PokeU(@RComplement + y, 'G')
Case 'T' : PokeU(@RComplement + y, 'A') ; Case 'T' : CopyMemory(*A, @RComplement + y, 2) ; 6 sec slower
Case 'G' : PokeU(@RComplement + y, 'C')
Case 'A' : PokeU(@RComplement + y, 'T')
EndSelect
Next

RComplement = ReverseString(RComplement) ; reverse complement
z = CompareMemoryString(@RComplement, @DNAsearchSeq, #PB_String_CaseSensitive, -1, #PB_Unicode)
If z = -1
Swap DNAsearchSeq, RComplement ; canonize in lexiographicly order, (option: permute/blend 32 bytes)
EndIf
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

Thanks for the info.
I'm still doubting between two different approaches.
Is it of any importance the k-mer counts are returned in alphabetical order or doesn't that matter ?
Windows (x64)
Raspberry Pi OS (Arm64)
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: A small procedure asm

Post by Sooraa »

@Wilbert

Even if it is more convenient, the focus is on controllable memory- and time consumtion on desktop
See, k-mering is to "find the unknown", may be the high counts or the frequencies in a longer genome, or an unknown
long sequence.
The sorting & grouping can though be a selectable option in conventional postprocessing.

On my side I'm focussing on an approach using PCMPIST* with "equal_ordered" and POPCNT. Currently
I'm tossing with mask- and index output. They show incompatible behaviour right now.
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: A small procedure asm

Post by Sooraa »

@Wilbert

FYI: In my try to use (3/1c-latency) "popcnt" with 1 addl. clock by using "pcmpistrm" shows the same obstacles as with Google's
bigdata n-gram project https://books.google.com/ngrams/info# and 50 TB-data under https://storage.googleapis.com/books/ng ... etsv3.html
"Equal Ordered" in PCMP*STR* is a speedy instruction (although some people don't like it), but it has a unfortunate disadvantage:
PCMP*STRM creates "false positves" in form of "sub-substrings".
F.e.: 16-byte search-mask is: "ACGTACGTACGTACGT", the string content is the same: This produces a substring-count of four, not one!
To handle this in K-mers up to 16 destroys the sub-mask counting with popcnt.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

Unfortunately I'm not really familiar with these instructions.
I usually try to stick to SSE2 so things work also on older hardware.
Google did come up with this page
http://forums.purebasic.com/german/view ... =3&t=28159
Does that help ?
Windows (x64)
Raspberry Pi OS (Arm64)
Post Reply