A small procedure asm

Sooraa · Post by **Sooraa** » Wed Jun 02, 2021 6:22 pm

The most elegant download site is:

https://www.ncbi.nlm.nih.gov/datasets/g ... 04526295.1

The download is selfexplanatory. The only thing you should do is to set the file content into all upper case and remove five textual separators
">CP038189.1 ...."

Or:
If you can provide a ftp-address for me, I could load it there.

Addl. Info:
-The u/l-cases are for our later use: lower case are "repeats" from the preprocessing of this file version, the upper cases are so called "non-repeating" sequences. For our core-counting process evaluation just put the file all in one case.

Alternative download-sites:
https://parasite.wormbase.org/Caenorhab ... nfo/Index/
https://www.ncbi.nlm.nih.gov/genome/41

The download site for human genome files is:
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/
Then: "latest", then "hg38.fa.gz"
A single human chromosome size is between 50 and 200 mio, which is in the ballpark of c.elegans. For our current purposes it doesn' matter right now, which data we use. What to choose for the features of a real program depends on a variety of intentions, accompaniing files, policies and sources....

Be aware that there are several consortia and database providers, with different science goals with different stages and dates of the same genome.

@Wilbert:
I remember what you mentioned i.r. to genome sizes. Give a bit time to express my thoughts more precisely there.
(The first would be: If I could have a x64 register, I'll take that, all smaller sizes are included....)

Sooraa · Post by **Sooraa** » Wed Jun 02, 2021 6:56 pm

I'm a bit shy to post big data via the board, but to answer your question "... every Kmer... here a little output from the beginning of the output for elegans:

I just store K and count. The index is taken internally to retrieve the DNA string in the length of K during output of this test. Just for the purpose of counting we would only need K, count and the string or the index. The float is the self-entropy of the Kmer-sequence.
What I skip here are the positions of the multiple strings, since these can be retrieved afterwards by a sequential search if we have K and string or index.
FYI: The file output in this structure is about 68 GB in size for Kmer 1-31. Of corse if we limit the figures of interest to exceptionals the output becomes much smaller: F.e. Just K as a byte, count and index(shorter/faster as strings) as longs. The nuc could be retrieved later as well as things like entropy etc. could be computed later.

K count nuc entropy
1 18230794 C 0.0000000
1 18206989 G 0.0000000
1 33216911 T 0.0000000
1 33191088 A 0.0000000
2 6394886 TC 1.0000000
2 3439538 GG 0.0000000
2 6553946 TA 1.0000000
2 3450860 CC 0.0000000
2 6357146 CA 1.0000000
2 13914503 TT 0.0000000
2 3416579 GC 1.0000000
2 6383824 GA 1.0000000
2 5230325 CT 1.0000000
2 4967048 GT 1.0000000
2 5221412 AG 1.0000000
2 4968468 AC 1.0000000
2 13896172 AA 0.0000000
2 6353576 TG 1.0000000
2 9105035 AT 1.0000000
2 3192463 CG 1.0000000
3 1760174 ACA 0.9182958
3 1856785 AGA 0.9182958
3 6443628 AAA 0.0000000
3 2253694 TTA 0.9182959
3 1611180 CAT 1.5849625
3 808088 CGT 1.5849625
3 859645 CCT 0.9182958
3 1854296 GTT 0.9182959
3 1105455 CAC 0.9182959
3 562200 CGC 0.9182958
3 607946 CCC 0.0000000
.......

wilbert · Post by **wilbert** » Thu Jun 03, 2021 8:34 am

Sooraa wrote: Wed Jun 02, 2021 6:22 pmThe download is selfexplanatory. The only thing you should do is to set the file content into all upper case and remove five textual separators

I did that but the file besides A,C,G and T also contains N.

Sooraa · Post by **Sooraa** » Thu Jun 03, 2021 1:24 pm

@Wilbert

sorry, but there are so many versions, formats, precisions, coverages in different databases and different stati....
The "N"s come from another maskedRepeater and mean also repeats. For our test purposes we can simply scratch them out.

To your question: To me, optimizing for 4.7 MB input, 102 MB or 3000 MB input is very different.
You are right, it's different for sure.

Upper bound: 3.000 MB would cover the complete human genome. As mentioned, I've learned that especially the 23 human chromosomes are a more or less separated "DNA container"s which sizes are btw. 50 and 240 MB. The human DNA is information denser in opposite with the lungfish "Protopterus aethiopicus" with 139.000 MB, but only a few chromosomes. This could be of special interest to watch out the up to today so called "Junk DNA". But that's not the goal.

Lower bound: The 4.7 MB could be computed and handled more or less with "on board" tools of PB. If the demand would be to have a batch Kmer processor for many small repeats this could be of interest. But I don't mean it right now.

So this brings us into the fences of, let's say 20 to 240 MB.
Here I can add the uncertainty of the max. goal of Kmer-len. Most Kmer counting programs scratch at the maximum of K-31. A few manage the upper sizes of 200.

What is for sure is that a Kmer Counting program should cover 1 to 31 in one program-domain: 1 to 200 would be better, because the Kmers as strings criss-cross in different Kmer-group (ACGTACGTACGT appear as clusters in Kmer 12 and let's say 17) results.

Up to now we have the PB-HashMap approach which showd stability up to 110 MB with Kmer 1-31. Idle showed a very fast SQUINT2 version with 250.000 bps and single Kmer31.
Every alternative in Kmer 1 to 200 with 240 MB max. would be an advantageous option.

wilbert · Post by **wilbert** » Thu Jun 03, 2021 2:02 pm

Thanks for the additional information.
I'll try to code something capable of processing a 240 MB input and a max K of 31 with a focus on operating speed.
Bigger K values like the 200 you mention require a different approach which makes it more difficult for me to optimize using asm and to support canonical.
I believe the approach idle is taking allows for bigger K values.

Sooraa · Post by **Sooraa** » Fri Jun 04, 2021 8:16 am

That would bee too nice.

Sometimes it's just a matter of beeing aware, what triggers the way we do something:

The success of Kmer processing suffers under the reading errors which come out of the analog data capture process. So already the
source data are "fuzzy". You find one and the same sequence artificially cut off or attached to wrong substrings. The consequence are
wrong Kmer sequences, false sequence counting/matching signals.

This problem will stay and it adds to the lacks of "fuzzy KmerCounters". I didn't see a KmerCounter that offers a solution for that issue. As mentioned earlier I will try to adopt "similarity" or "error tolerance" techniques like edit distance as Levenshtein / Damerau etc. I even have a one-line/non matrix Levenshtein routine which prevents two-dimensional memory growths and is quite fast.

To make a long story short: Your work should not include this, but may be that you find a processor flag, a register, a structure field which opens
the interface option for error tolerant computing by a subroutine or postprocessing.

PureBasic Forums - English

A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm