Page 5 of 9
Re: A small procedure asm
Posted: Tue May 04, 2021 9:57 am
by Sooraa
Idle,
wouldn't you typically do an analysis over a range of k rather than a fixed k
I want to follow your thoughts exactly: What sample analysis do you want or mean?
- An expansion of my small handcrafted example to more k-mers? That would be by hand and quite error prone for a set of meaningful results, or
- A sample output of k-mer statistics of real DNA-data, f.e. the small 2phi-X174 genome? That would require the attachment of the result-file, which is not big, but larger as to paste it into a post.
Please let me know.
P.S. I saw your Squint approach before wilbert's ASM, but was more "electrified" by the ASM-features of wilbert version then. See the exploding numbers described in
https://bioinfologics.github.io/post/20 ... roduction/
Re: A small procedure asm
Posted: Tue May 04, 2021 11:23 am
by idle
Sooraa wrote: ↑Tue May 04, 2021 9:57 am
Idle,
wouldn't you typically do an analysis over a range of k rather than a fixed k
I want to follow your thoughts exactly: What sample analysis do you want or mean?
- An expansion of my small handcrafted example to more k-mers? That would be by hand and quite error prone for a set of meaningful results, or
- A sample output of k-mer statistics of real DNA-data, f.e. the small 2phi-X174 genome? That would require the attachment of the result-file, which is not big, but larger as to paste it into a post.
Please let me know.
P.S. I saw your Squint approach before wilbert's ASM, but was more "electrified" by the ASM-features of wilbert version then. See the exploding numbers described in
https://bioinfologics.github.io/post/20 ... roduction/
I expect the results of doing a count of k=2 vs a count of k=3 would be quite different, so wouldn't you do a range, it's not just the total count that's significant.
if you can show what you want that would help a lot.
Re: A small procedure asm
Posted: Tue May 04, 2021 1:41 pm
by Sooraa
Idle,
thanks for your attention.
For this little example I used the first 75 bp of the real phiX174 bacteriophage genome data:
"GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAA"
As an introduction this creates the following 1-1 k-mer w/o canonical reverse (the board-Tabs makes it harder to read):
KmerLen: 1-1, 1-1-Frequency: 4, file-size/nucleotids: 75
Legend: gbt = group by totals, gbu = group by uniques
k-mer totals distin. unique DNAsearchSeq DNAIndex
1-0 26 1 0 T 72
1 gbt
1-1 23 1 0 A 74
1 gbt
1-2 14 1 0 G 70
1 gbt
1-3 12 1 0 C 67
1 gbt
66 1 65472 66 Map-Adds+Finds, distincs, uniques, K-len-Sum
4 4 71 75 # of totals, # of distincts, # of uniques, # of K-len-Sums
You asked for a 2-3 k-mer, same 75 bp:
KmerLen: 2-3, 2-3Frequency: 57, file-size/nucleotids: 75
Legend: gbt = group by totals, gbu = group by uniques
k-mer totals distin. unique DNAsearchSeq DNAIndex
2-0 11 1 0 TT 68
1 gbt
2-1 8 1 0 AT 71
2-2 8 1 0 GA 70
2 gbt
2-3 7 1 0 AA 73
1 gbt
2-4 6 1 0 TC 66
1 gbt
2-5 5 1 0 TA 72
1 gbt
3-6 4 1 0 TGA 69
2-7 4 1 0 TG 69
3-8 4 1 0 TTT 44
2-9 4 1 0 CT 67
2-10 4 1 0 CG 56
2-11 4 1 0 AG 53
6 gbt
2-12 3 1 0 CA 32
3-13 3 1 0 TCG 55
3-14 3 1 0 AGT 53
3-15 3 1 0 TTA 63
2-16 3 1 0 AC 33
3-17 3 1 0 AAA 60
3-18 3 1 0 CTT 67
3-19 3 1 0 TTC 45
3-20 3 1 0 TAT 64
3-21 3 1 0 GAT 70
2-22 3 1 0 GT 54
11 gbt
3-23 2 1 0 GTT 27
3-24 2 1 0 ATG 50
3-25 2 1 0 ATT 62
3-26 2 1 0 GAG 52
3-27 2 1 0 ATC 65
3-28 2 1 0 TAA 72
3-29 2 1 0 GAA 57
2-30 2 1 0 GC 21
3-31 2 1 0 ATA 71
3-32 2 1 0 CGC 20
3-33 2 1 0 TCT 66
11 gbt
34 gbu
2-34 1 1 1 GG 39
3-35 1 1 1 GGA 39
2-36 1 1 1 CC 14
3-37 1 1 1 GTC 54
3-38 1 1 1 CAG 22
3-39 1 1 1 CAC 32
3-40 1 1 1 TTG 68
3-41 1 1 1 ACT 33
3-42 1 1 1 TCC 13
3-43 1 1 1 CAT 15
3-44 1 1 1 CGA 56
3-45 1 1 1 CCA 14
3-46 1 1 1 CGG 38
3-47 1 1 1 CTG 47
3-48 1 1 1 AAT 61
3-49 1 1 1 GCA 21
3-50 1 1 1 ACA 31
3-51 1 1 1 GAC 18
3-52 1 1 1 GCT 10
3-53 1 1 1 ACG 19
3-54 1 1 1 AGA 23
3-55 1 1 1 AAG 25
3-56 1 1 1 AAC 30
23 gbt
23 gbu
5 1 65533 15 Map-Adds+Finds, distincs, uniques, K-len-Sum
57 57 90 147 # of totals, # of distincts, # of uniques, # of K-len-Sums
I would be happy, if this works through this board-method. For your convenience, I send the 1to2 and 2to2 versions on next post
Re: A small procedure asm
Posted: Tue May 04, 2021 1:51 pm
by Sooraa
@Idle
Input: first 75 bp of phi-X174
KmerLen: 2-2, 2-2-Frequency: 16, file-size/nucleotids: 75
Legend: gbt = group by totals, gbu = group by uniques
k-mer totals distin. unique DNAsearchSeq DNAIndex
2-0 11 1 0 TT 68
1 gbt
2-1 8 1 0 GA 70
2-2 8 1 0 AT 71
2 gbt
2-3 7 1 0 AA 73
1 gbt
2-4 6 1 0 TC 66
1 gbt
2-5 5 1 0 TA 72
1 gbt
2-6 4 1 0 TG 69
2-7 4 1 0 AG 53
2-8 4 1 0 CT 67
2-9 4 1 0 CG 56
4 gbt
2-10 3 1 0 AC 33
2-11 3 1 0 GT 54
2-12 3 1 0 CA 32
3 gbt
2-13 2 1 0 GC 21
1 gbt
14 gbu
2-14 1 1 1 CC 14
2-15 1 1 1 GG 39
2 gbt
2 gbu
1 65503 70 distincts, uniques, K-len-Sum
16 16 58 74 # of totals, # of distincts, # of uniques, # of K-len-Sums
Input: first 75 bp of phi-X174
KmerLen: 1-2, mapFinds: 129, mapAdds: 20, mapSize: 20, 1-2-Frequency: 20, file-size/nucleotids: 75
Legend: gbt = group by totals, gbu = group by uniques
k-mer totals distin. unique DNAsearchSeq DNAIndex
1-0 26 1 0 T 72
1 gbt
1-1 23 1 0 A 74
1 gbt
1-2 14 1 0 G 70
1 gbt
1-3 12 1 0 C 67
1 gbt
2-4 11 1 0 TT 68
1 gbt
2-5 8 1 0 GA 70
2-6 8 1 0 AT 71
2 gbt
2-7 7 1 0 AA 73
1 gbt
2-8 6 1 0 TC 66
1 gbt
2-9 5 1 0 TA 72
1 gbt
2-10 4 1 0 TG 69
2-11 4 1 0 AG 53
2-12 4 1 0 CT 67
2-13 4 1 0 CG 56
4 gbt
2-14 3 1 0 AC 33
2-15 3 1 0 CA 32
2-16 3 1 0 GT 54
3 gbt
2-17 2 1 0 GC 21
1 gbt
18 gbu
2-18 1 1 1 CC 14
2-19 1 1 1 GG 39
2 gbt
2 gbu
1 65503 70 distincts, uniques, K-len-Sum
20 20 129 149 # of totals, # of distincts, # of uniques, # of K-len-Sums
Re: A small procedure asm
Posted: Wed May 05, 2021 1:04 am
by idle
Thanks for the explanation.
A quick hack of the find string to do k-mers style find it doesn't account for reverse or canonical.
output for the string "AGCTTTTCATTCTGACTGCAA" by "CT"
3 Tokens found
item, count and the positions of the items
CT 3
4: CT 22: CT 30: CT
replaced found strings with ---
AG--TTTCATT--GA--GCAA
see for the Squint module
viewtopic.php?f=12&t=75783
Code: Select all
IncludeFile "Squint2.pbi"
UseModule SQUINT
EnableExplicit
Structure Item
key.s
count.l
len.l
List positions.i()
EndStructure
Structure FindStrings
List *item.item()
EndStructure
Global FindStringsItems.FindStrings
Procedure Kmers(*squint.squint,*source,*keys,k=2,*items.FindStrings=0)
Protected *inp.Character,*inp1.Character,*node.squint_node,*sp
Protected c,count,key.s,*item.item
If *source
*inp = *source
*sp = *source
While *inp\c <> 0
c=1
*inp1 = *inp
While (*inp1\c <> 0 And c <= k)
*inp1+SizeOf(Character)
c+1
Wend
key = PeekS(*source,(*inp1-*source)/SizeOf(Character))
*node = Squintset(*squint,@key,0)
If Not *node\value
*item = AllocateStructure(item)
AddElement(*item\positions())
*item\positions() = *source - *sp
*item\count = 1
*item\key = key
*item\len = (*inp1-*source)/SizeOf(Character)
*node\value = *item
Else
*item = *node\value
AddElement(*item\positions())
*item\positions() = *source - *sp
*item\len = (*inp1-*source)/SizeOf(Character)
*item\count + 1
EndIf
If *inp1\c <> 0
*inp+2
*source = *inp
Else
Break
EndIf
Wend
EndIf
*inp = *keys
While *inp\c <> 0
c=1
*inp1 = *inp
While (*inp1\c <> 0 And c <= k)
*inp1+SizeOf(Character)
c+1
Wend
key = PeekS(*keys,(*inp1-*keys)/SizeOf(Character))
*item = Squintget(*squint,@key)
If *item
count + *item\count
If *items
If *item\count >= 1
AddElement(*items\item())
*items\item() = *item
EndIf
EndIf
EndIf
If *inp1\c <> 0
*inp+2
*keys = *inp
Else
Break
EndIf
Wend
ProcedureReturn count
EndProcedure
Procedure cbFindStringsFree(*key,*value,*Data)
FreeStructure(*value)
EndProcedure
Procedure cbFindStringsEnum(*key,*value,*items.FindStrings)
AddElement(*items\item())
*items\item() = *value
Debug PeekS(*key,-1,#PB_UTF8)
EndProcedure
Procedure FindStringsFree(*squint.squint)
SquintWalk(*squint,@cbFindStringsFree())
SquintFree(*squint)
EndProcedure
Procedure FindStringsEnum(*mt.squint,key.s,*items.FindStrings)
ClearList(*items\item())
SquintEnum(*mt,@key,@cbFindStringsEnum(),*items)
EndProcedure
Global String3.s = "AGCTTTTCATTCTGACTGCAA"
Global String4.s = "CT"
Global out.s
Global FindStringsItems.FindStrings
Global *squint.squint = SquintNew()
Global Replace.s = "-----------------------------------------------------------"
Debug Str(Kmers(*squint,@String3,@String4,2,@FindStringsItems)) + " k-mers found"
Debug "item, count and it position and the item "
ForEach FindStringsItems\item()
out=""
Debug FindStringsItems\item()\key + " " + Str(FindStringsItems\item()\count)
ForEach FindStringsItems\item()\positions()
out + Str(FindStringsItems\item()\positions()) + ": " + PeekS(@string3 + FindStringsItems\item()\positions(),FindStringsItems\item()\len) + " "
CopyMemory(@Replace,@string3+FindStringsItems\item()\positions(),FindStringsItems\item()\len*SizeOf(Character)) ;replace found items
Next
Debug out
Next
Debug "replaced found strings with --"
Debug string3
Debug "Enum from C in callback"
FindStringsEnum(*squint,"C",@FindStringsItems)
Debug "foreach after Enum"
ForEach FindStringsItems\item()
Debug FindStringsItems\item()\key
Next
FindStringsFree(*squint)
Re: A small procedure asm
Posted: Wed May 05, 2021 2:36 am
by Sooraa
@Idle,
I get an invalid memory access on line 45
"AddElement(*item\positions())"
when debugger is on.
Same on PB5.40 & PB5.72 under WIN7/64
It's a failure with a list, so it should be easier to find the problem myself. I will look after it tomorrow.
Re: A small procedure asm
Posted: Wed May 05, 2021 2:58 am
by idle
Sooraa wrote: ↑Wed May 05, 2021 2:36 am
@Idle,
I get an invalid memory access on line 45
"AddElement(*item\positions())"
when debugger is on.
Same on PB5.40 & PB5.72 under WIN7/64
It's a failure with a list, so it should be easier to find the problem myself. I will look after it tomorrow.
No idea why its not working on win 10 x64 it works fine on linux x64 and Win x86 I don't have a Win 10 x64 to test on.
are you sure you have the most recent Squint.
try this
viewtopic.php?p=569665#p569665
Re: A small procedure asm
Posted: Wed May 05, 2021 6:59 am
by Sooraa
@Idle,
goood mooooorniiiing.
As I said, to double check my side was my first thing this morning. After installing Squint 2.0.5, the ghost was gone.
Will scrutinize this afternoon.
Re: A small procedure asm
Posted: Wed May 05, 2021 3:14 pm
by wilbert
Sooraa wrote: ↑Mon May 03, 2021 10:40 am
@Wilbert:
In BioInformatics a method called "k-mer counting" is quite popular and helpful in DNA analyses.
I'm glad Idle already replied.
I'm not familiar with the matter.
I was wondering if a rolling hash might work for this k-mer counting.
Using 2 bits values for A C G T might also work.
In that case a 32 bit compare instruction could compare 16 characters at once.
Re: A small procedure asm
Posted: Wed May 05, 2021 4:18 pm
by Sooraa
Idle,
I put your 21-sample into my logic. As you can see, "CT" is found 3 times as k-mer "2-1" there. Your display
"AG--TTTCATT--GA--GCAA" simulates exactly. The only difference are the positions for "DNAIndex" 4 as 2 ... 30 as 15. DNA files are Ascii, perhaps you grab the number as Unicode-space.
KmerLen: 2-2, 2-2-Frequency: 11, file-size/nucleotids: 21
Legend: gbt = group by totals, gbu = group by uniques
k-mer totals distin. unique DNAsearchSeq DNAIndex
2-0 4 1 0 TT 9
1 gbt
2-1 3 1 0 CT 15
1 gbt
2-2 2 1 0 TC 10
2-3 2 1 0 CA 18
2-4 2 1 0 TG 16
2-5 2 1 0 GC 17
4 gbt
6 gbu
2-6 1 1 1 AA 19
2-7 1 1 1 AT 8
2-8 1 1 1 AC 14
2-9 1 1 1 GA 13
2-10 1 1 1 AG 0
5 gbt
5 gbu
1 1 2 distincs, uniques, K-len-Sum
11 11 9 20 # of totals, # of distincts, # of uniques, # of K-len-Sums
Now the interesting step is to come: step one bp to the right and repeat the process until we have all positions evaluated to group the k-mers "2-0" to "2-10" like above.... Dont' care about distincts and uniques, this I could wave in on my own.
Re: A small procedure asm
Posted: Wed May 05, 2021 4:36 pm
by Sooraa
@Wilbert,
you would'nt need to jumpt into the k-mer matter.
The only wish I have is to build in a trigger on a single word in String1 which then issues a trigger for String2.
Like the sample of Idle: "CT" as string1 has a lenght of 2. Then a #LAST_SEPARATION_CHAR as the separation trigger in string2 should be raised.
Even the repeating loop i could try to integrate, if you allow.
Your comment i.r. to 2 bit values for ACGT is right. A register compare of 32 brings additional logic for the flexibility of len 1,2,3 ... 5,6,7 etc.
Re: A small procedure asm
Posted: Wed May 05, 2021 11:20 pm
by idle
Sooraa wrote: ↑Wed May 05, 2021 4:18 pm
Idle,
I put your 21-sample into my logic. As you can see, "CT" is found 3 times as k-mer "2-1" there. Your display
"AG--TTTCATT--GA--GCAA" simulates exactly. The only difference are the positions for "DNAIndex" 4 as 2 ... 30 as 15. DNA files are Ascii, perhaps you grab the number as Unicode-space.
KmerLen: 2-2, 2-2-Frequency: 11, file-size/nucleotids: 21
Legend: gbt = group by totals, gbu = group by uniques
k-mer totals distin. unique DNAsearchSeq DNAIndex
2-0 4 1 0 TT 9
1 gbt
2-1 3 1 0 CT 15
1 gbt
2-2 2 1 0 TC 10
2-3 2 1 0 CA 18
2-4 2 1 0 TG 16
2-5 2 1 0 GC 17
4 gbt
6 gbu
2-6 1 1 1 AA 19
2-7 1 1 1 AT 8
2-8 1 1 1 AC 14
2-9 1 1 1 GA 13
2-10 1 1 1 AG 0
5 gbt
5 gbu
1 1 2 distincs, uniques, K-len-Sum
11 11 9 20 # of totals, # of distincts, # of uniques, # of K-len-Sums
Now the interesting step is to come: step one bp to the right and repeat the process until we have all positions evaluated to group the k-mers "2-0" to "2-10" like above.... Dont' care about distincts and uniques, this I could wave in on my own.
Sorry I'm not clear on what you want doing. A test of a 10000 characters sequence, k=5 and we're interested in finding chunks of a sequence AATGAGCG when k=5 the k-mers of interest are AATGA ATGAG TGAGC GAGCG
Is this what you expect it to do?
don't worry about the positions at the moment. just worry about the processing
AATACGCCAAGTCCATGCTATCCAGTGAATACGAGGCGCTAGATTTACAAGAACATACGAACTGAAATAAAGTTTTGCTAGCTTAGGGAGGGGTCCAATG...
AATGAGCG
42 k-mers found in sequence
item count positions
AATGA 12 2142: 3746: 3754: 5474: 6148: 11476: 12712: 13192: ...
ATGAG 14 3578: 4128: 5042: 5164: 6968: 8704 : 11322: 12566: ...
histogram of the sequence k=5
AAAAA 14
AAAAC 13
AAAAG 11
AAAAT 14
AAACA 7
...
TTTGT 10
TTTTA 6
TTTTC 8
TTTTG 11
TTTTT 10
Re: A small procedure asm
Posted: Thu May 06, 2021 2:35 pm
by Sooraa
@Idle:
K-mer counting is to find medium sized (up to 110) nucleotide-sequences in a genome file (ATGC)
Question is not to find the needle in a haystack. In this case you would know the needle.
No, here we are searching for unknown prominent sequences, so called k-mer sequences. (where "k" is the length of the nucleotide-sequence under test (NUC)).
To generate any and all potential NUCs in a sequence-file we start with counting 1-mers and group the totals in ths case for A, T, C, G. (One could
assume that 1-mer counting is worthless, but in BioChemestry even the Stats for 1-mers are scrutinized between genomes)
In our case the programs collect the single k-mer results (starting at 1-mer to 31-mer in the most program cases(fits into an 64-bit-reg)). So my current PB-map-based k-mer counting program can be started with k-value of 1 and loops over the file to 21 f.e., depending on the genome-sequence. The counts are grouped by k and the length of generated and found sequences.
Already in the case of the mini-genome phi-X174 with a length of 5378 the counts show that, what they are generated for: f.e. the 9-mer
"AAAGAGATT" appears 4 times and the 5-mer "ACTCT" also just only 4 times, although the so called k-universes are 4^9 = 262.144 and 4^5 = 1.024.
The counts of the NUCs grow quadraticly and consume process time and memory: .... 21mer alone is 4.398.046.511.104, 22mer adds
another 17.592.186.044.416 to 21mer .... 31mer is 4.611.686.018.427.387.900.
So I want to see at the end the behavior of a trie (in your case nibble) in regards to size and primarily the times for the trie-adds against what I have so far (PB-map). This can be achieved by iterating over a string with 1-mer, 2-mer ...... to minimum 27-mer. The grouping of the results, the unique and distinct NUCs I will do then.
Re: A small procedure asm
Posted: Thu May 06, 2021 5:11 pm
by wilbert
Sooraa wrote: ↑Thu May 06, 2021 2:35 pmK-mer counting is to find medium sized (up to 110) nucleotide-sequences in a genome file (ATGC)
Question is not to find the needle in a haystack. In this case you would know the needle.
It's starting to make sense to me now
A few questions ...
I have no idea how long a genome file can be. What's the maximum count for a 1-mer ?
Do you read the entire genome file to system memory or process it from disk ?
Is it required to verify the input data or can you trust the input only to contain the A,T,G and C characters ?
Re: A small procedure asm
Posted: Thu May 06, 2021 7:25 pm
by Sooraa
@Wilbert:
What's the maximum count for a 1-mer ? --> The Human Genome in the latest, but still not perfect content file: GRCh38.p2 with: 3,21 GB Ascii
Under
http://hgdownload.soe.ucsc.edu/goldenPa ... ps/latest/ one can find "hg38.2bit" with a size of 810 MB 2-bit for 3,21 GB Ascii
Entire genome file to system memory or process it from disk ? --> Up to now per "ReadData" into memory. It's a matter of results of this
fact-finding-evaluation. I can count single 128-mers on responsible sized genomes, sort & group them and write the product to disk.
Verify the input data ? --> No, at the stage of k-mer counting the files are reliable. The earlier stages like so called "reads" have plenty of
genetic-errors (but already A,C,T,G only) and redundancies.
For your interest:
https://gatk.broadinstitute.org/hc/en-u ... 8-b37-hg19
http://genome.ucsc.edu/cgi-bin/hgGatewa ... an&db=hg38
http://genome-euro.ucsc.edu/cgi-bin/hgG ... e.ucsc.edu
To my knowledge the longest sequence found is 141