PureBasic Forums - English

Posted: **Tue May 04, 2021 9:57 am**

Idle,

wouldn't you typically do an analysis over a range of k rather than a fixed k

I want to follow your thoughts exactly: What sample analysis do you want or mean?
- An expansion of my small handcrafted example to more k-mers? That would be by hand and quite error prone for a set of meaningful results, or
- A sample output of k-mer statistics of real DNA-data, f.e. the small 2phi-X174 genome? That would require the attachment of the result-file, which is not big, but larger as to paste it into a post.

Please let me know.

P.S. I saw your Squint approach before wilbert's ASM, but was more "electrified" by the ASM-features of wilbert version then. See the exploding numbers described in https://bioinfologics.github.io/post/20 ... roduction/

Posted: **Tue May 04, 2021 11:23 am**

Sooraa wrote: ↑Tue May 04, 2021 9:57 am Idle,
wouldn't you typically do an analysis over a range of k rather than a fixed k
I want to follow your thoughts exactly: What sample analysis do you want or mean?
- An expansion of my small handcrafted example to more k-mers? That would be by hand and quite error prone for a set of meaningful results, or
- A sample output of k-mer statistics of real DNA-data, f.e. the small 2phi-X174 genome? That would require the attachment of the result-file, which is not big, but larger as to paste it into a post.

Please let me know.

P.S. I saw your Squint approach before wilbert's ASM, but was more "electrified" by the ASM-features of wilbert version then. See the exploding numbers described in https://bioinfologics.github.io/post/20 ... roduction/

I expect the results of doing a count of k=2 vs a count of k=3 would be quite different, so wouldn't you do a range, it's not just the total count that's significant.

if you can show what you want that would help a lot.

Posted: **Tue May 04, 2021 1:41 pm**

Idle,
thanks for your attention.

For this little example I used the first 75 bp of the real phiX174 bacteriophage genome data:
"GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAA"

As an introduction this creates the following 1-1 k-mer w/o canonical reverse (the board-Tabs makes it harder to read):
KmerLen: 1-1, 1-1-Frequency: 4, file-size/nucleotids: 75
Legend: gbt = group by totals, gbu = group by uniques

k-mer totals distin. unique DNAsearchSeq DNAIndex
1-0 26 1 0 T 72
1 gbt
1-1 23 1 0 A 74
1 gbt
1-2 14 1 0 G 70
1 gbt
1-3 12 1 0 C 67
1 gbt
66 1 65472 66 Map-Adds+Finds, distincs, uniques, K-len-Sum
4 4 71 75 # of totals, # of distincts, # of uniques, # of K-len-Sums

You asked for a 2-3 k-mer, same 75 bp:
KmerLen: 2-3, 2-3Frequency: 57, file-size/nucleotids: 75
Legend: gbt = group by totals, gbu = group by uniques

k-mer totals distin. unique DNAsearchSeq DNAIndex
2-0 11 1 0 TT 68
1 gbt
2-1 8 1 0 AT 71
2-2 8 1 0 GA 70
2 gbt
2-3 7 1 0 AA 73
1 gbt
2-4 6 1 0 TC 66
1 gbt
2-5 5 1 0 TA 72
1 gbt
3-6 4 1 0 TGA 69
2-7 4 1 0 TG 69
3-8 4 1 0 TTT 44
2-9 4 1 0 CT 67
2-10 4 1 0 CG 56
2-11 4 1 0 AG 53
6 gbt
2-12 3 1 0 CA 32
3-13 3 1 0 TCG 55
3-14 3 1 0 AGT 53
3-15 3 1 0 TTA 63
2-16 3 1 0 AC 33
3-17 3 1 0 AAA 60
3-18 3 1 0 CTT 67
3-19 3 1 0 TTC 45
3-20 3 1 0 TAT 64
3-21 3 1 0 GAT 70
2-22 3 1 0 GT 54
11 gbt
3-23 2 1 0 GTT 27
3-24 2 1 0 ATG 50
3-25 2 1 0 ATT 62
3-26 2 1 0 GAG 52
3-27 2 1 0 ATC 65
3-28 2 1 0 TAA 72
3-29 2 1 0 GAA 57
2-30 2 1 0 GC 21
3-31 2 1 0 ATA 71
3-32 2 1 0 CGC 20
3-33 2 1 0 TCT 66
11 gbt
34 gbu
2-34 1 1 1 GG 39
3-35 1 1 1 GGA 39
2-36 1 1 1 CC 14
3-37 1 1 1 GTC 54
3-38 1 1 1 CAG 22
3-39 1 1 1 CAC 32
3-40 1 1 1 TTG 68
3-41 1 1 1 ACT 33
3-42 1 1 1 TCC 13
3-43 1 1 1 CAT 15
3-44 1 1 1 CGA 56
3-45 1 1 1 CCA 14
3-46 1 1 1 CGG 38
3-47 1 1 1 CTG 47
3-48 1 1 1 AAT 61
3-49 1 1 1 GCA 21
3-50 1 1 1 ACA 31
3-51 1 1 1 GAC 18
3-52 1 1 1 GCT 10
3-53 1 1 1 ACG 19
3-54 1 1 1 AGA 23
3-55 1 1 1 AAG 25
3-56 1 1 1 AAC 30
23 gbt
23 gbu
5 1 65533 15 Map-Adds+Finds, distincs, uniques, K-len-Sum
57 57 90 147 # of totals, # of distincts, # of uniques, # of K-len-Sums

I would be happy, if this works through this board-method. For your convenience, I send the 1to2 and 2to2 versions on next post

Posted: **Tue May 04, 2021 1:51 pm**

@Idle
Input: first 75 bp of phi-X174
KmerLen: 2-2, 2-2-Frequency: 16, file-size/nucleotids: 75
Legend: gbt = group by totals, gbu = group by uniques

k-mer totals distin. unique DNAsearchSeq DNAIndex
2-0 11 1 0 TT 68
1 gbt
2-1 8 1 0 GA 70
2-2 8 1 0 AT 71
2 gbt
2-3 7 1 0 AA 73
1 gbt
2-4 6 1 0 TC 66
1 gbt
2-5 5 1 0 TA 72
1 gbt
2-6 4 1 0 TG 69
2-7 4 1 0 AG 53
2-8 4 1 0 CT 67
2-9 4 1 0 CG 56
4 gbt
2-10 3 1 0 AC 33
2-11 3 1 0 GT 54
2-12 3 1 0 CA 32
3 gbt
2-13 2 1 0 GC 21
1 gbt
14 gbu
2-14 1 1 1 CC 14
2-15 1 1 1 GG 39
2 gbt
2 gbu
1 65503 70 distincts, uniques, K-len-Sum
16 16 58 74 # of totals, # of distincts, # of uniques, # of K-len-Sums

Input: first 75 bp of phi-X174
KmerLen: 1-2, mapFinds: 129, mapAdds: 20, mapSize: 20, 1-2-Frequency: 20, file-size/nucleotids: 75
Legend: gbt = group by totals, gbu = group by uniques

k-mer totals distin. unique DNAsearchSeq DNAIndex
1-0 26 1 0 T 72
1 gbt
1-1 23 1 0 A 74
1 gbt
1-2 14 1 0 G 70
1 gbt
1-3 12 1 0 C 67
1 gbt
2-4 11 1 0 TT 68
1 gbt
2-5 8 1 0 GA 70
2-6 8 1 0 AT 71
2 gbt
2-7 7 1 0 AA 73
1 gbt
2-8 6 1 0 TC 66
1 gbt
2-9 5 1 0 TA 72
1 gbt
2-10 4 1 0 TG 69
2-11 4 1 0 AG 53
2-12 4 1 0 CT 67
2-13 4 1 0 CG 56
4 gbt
2-14 3 1 0 AC 33
2-15 3 1 0 CA 32
2-16 3 1 0 GT 54
3 gbt
2-17 2 1 0 GC 21
1 gbt
18 gbu
2-18 1 1 1 CC 14
2-19 1 1 1 GG 39
2 gbt
2 gbu
1 65503 70 distincts, uniques, K-len-Sum
20 20 129 149 # of totals, # of distincts, # of uniques, # of K-len-Sums

Posted: **Wed May 05, 2021 1:04 am**

Thanks for the explanation.

A quick hack of the find string to do k-mers style find it doesn't account for reverse or canonical.
output for the string "AGCTTTTCATTCTGACTGCAA" by "CT"

3 Tokens found
item, count and the positions of the items
CT 3
4: CT 22: CT 30: CT
replaced found strings with ---
AG--TTTCATT--GA--GCAA

see for the Squint module
viewtopic.php?f=12&t=75783

Code: Select all

IncludeFile "Squint2.pbi"

UseModule SQUINT 
EnableExplicit 

Structure Item 
  key.s 
  count.l
  len.l
  List positions.i()
EndStructure 

Structure FindStrings 
  List *item.item() 
EndStructure

Global FindStringsItems.FindStrings 
 
Procedure Kmers(*squint.squint,*source,*keys,k=2,*items.FindStrings=0) 
  Protected *inp.Character,*inp1.Character,*node.squint_node,*sp    
  Protected c,count,key.s,*item.item    
  
  If *source 
    *inp = *source 
    *sp = *source 
    While *inp\c <> 0
      c=1
      *inp1 = *inp 
      While (*inp1\c <> 0 And c <= k)  
        *inp1+SizeOf(Character)
        c+1
      Wend 
      key = PeekS(*source,(*inp1-*source)/SizeOf(Character))
      *node = Squintset(*squint,@key,0) 
      If Not *node\value 
        *item =  AllocateStructure(item)
        AddElement(*item\positions()) 
        *item\positions() = *source - *sp
        *item\count = 1 
        *item\key = key 
        *item\len = (*inp1-*source)/SizeOf(Character)
        *node\value = *item
      Else   
        *item = *node\value 
        AddElement(*item\positions())
        *item\positions() = *source - *sp
        *item\len = (*inp1-*source)/SizeOf(Character)
        *item\count + 1 
      EndIf   
      If *inp1\c <> 0
        *inp+2
        *source = *inp 
      Else 
        Break 
      EndIf   
    Wend 
  EndIf 
  
  *inp = *keys 
  While *inp\c <> 0
    c=1
    *inp1 = *inp 
    While (*inp1\c <> 0 And c <= k)  
      *inp1+SizeOf(Character)
      c+1
    Wend 
    key = PeekS(*keys,(*inp1-*keys)/SizeOf(Character))
    *item = Squintget(*squint,@key)
    If *item
      count + *item\count  
      If *items
        If *item\count >= 1 
          AddElement(*items\item()) 
          *items\item() = *item 
        EndIf
      EndIf   
    EndIf   
    If *inp1\c <> 0
      *inp+2
      *keys = *inp 
    Else 
      Break 
    EndIf   
  Wend 
  
  ProcedureReturn count 
  
EndProcedure 


Procedure cbFindStringsFree(*key,*value,*Data) 
  FreeStructure(*value) 
EndProcedure   

Procedure cbFindStringsEnum(*key,*value,*items.FindStrings) 
  AddElement(*items\item()) 
  *items\item() = *value
  Debug PeekS(*key,-1,#PB_UTF8)
EndProcedure   

Procedure FindStringsFree(*squint.squint) 
  SquintWalk(*squint,@cbFindStringsFree()) 
  SquintFree(*squint) 
EndProcedure   

Procedure FindStringsEnum(*mt.squint,key.s,*items.FindStrings) 
  ClearList(*items\item())
  SquintEnum(*mt,@key,@cbFindStringsEnum(),*items) 
EndProcedure   

Global String3.s = "AGCTTTTCATTCTGACTGCAA"
Global String4.s = "CT" 

Global out.s
Global FindStringsItems.FindStrings 
Global *squint.squint = SquintNew() 
Global Replace.s = "-----------------------------------------------------------"

Debug Str(Kmers(*squint,@String3,@String4,2,@FindStringsItems)) + " k-mers found"

Debug "item, count and it position and the item " 

ForEach FindStringsItems\item() 
  out=""
  Debug FindStringsItems\item()\key + " " + Str(FindStringsItems\item()\count)
  ForEach FindStringsItems\item()\positions() 
    out + Str(FindStringsItems\item()\positions()) + ": " + PeekS(@string3 + FindStringsItems\item()\positions(),FindStringsItems\item()\len) + " " 
    CopyMemory(@Replace,@string3+FindStringsItems\item()\positions(),FindStringsItems\item()\len*SizeOf(Character)) ;replace found items
  Next 
  Debug out 
Next

Debug "replaced found strings with --"
Debug string3 

Debug "Enum from C in callback" 
 
FindStringsEnum(*squint,"C",@FindStringsItems) 

Debug "foreach after Enum" 

ForEach FindStringsItems\item() 
  Debug FindStringsItems\item()\key 
Next

FindStringsFree(*squint)

Posted: **Wed May 05, 2021 2:36 am**

@Idle,

I get an invalid memory access on line 45
"AddElement(*item\positions())"
when debugger is on.

Same on PB5.40 & PB5.72 under WIN7/64
It's a failure with a list, so it should be easier to find the problem myself. I will look after it tomorrow.

Posted: **Wed May 05, 2021 2:58 am**

Sooraa wrote: ↑Wed May 05, 2021 2:36 am @Idle,

I get an invalid memory access on line 45
"AddElement(*item\positions())"
when debugger is on.

Same on PB5.40 & PB5.72 under WIN7/64
It's a failure with a list, so it should be easier to find the problem myself. I will look after it tomorrow.

No idea why its not working on win 10 x64 it works fine on linux x64 and Win x86 I don't have a Win 10 x64 to test on.
are you sure you have the most recent Squint.

try this
viewtopic.php?p=569665#p569665

Posted: **Wed May 05, 2021 6:59 am**

@Idle,

goood mooooorniiiing.

As I said, to double check my side was my first thing this morning. After installing Squint 2.0.5, the ghost was gone.
Will scrutinize this afternoon.

Posted: **Wed May 05, 2021 3:14 pm**

Sooraa wrote: ↑Mon May 03, 2021 10:40 am @Wilbert:
In BioInformatics a method called "k-mer counting" is quite popular and helpful in DNA analyses.

I'm glad Idle already replied.

I'm not familiar with the matter.

I was wondering if a rolling hash might work for this k-mer counting.

Using 2 bits values for A C G T might also work.
In that case a 32 bit compare instruction could compare 16 characters at once.

Posted: **Wed May 05, 2021 4:18 pm**

Idle,

I put your 21-sample into my logic. As you can see, "CT" is found 3 times as k-mer "2-1" there. Your display
"AG--TTTCATT--GA--GCAA" simulates exactly. The only difference are the positions for "DNAIndex" 4 as 2 ... 30 as 15. DNA files are Ascii, perhaps you grab the number as Unicode-space.

KmerLen: 2-2, 2-2-Frequency: 11, file-size/nucleotids: 21
Legend: gbt = group by totals, gbu = group by uniques

k-mer totals distin. unique DNAsearchSeq DNAIndex
2-0 4 1 0 TT 9
1 gbt
2-1 3 1 0 CT 15
1 gbt
2-2 2 1 0 TC 10
2-3 2 1 0 CA 18
2-4 2 1 0 TG 16
2-5 2 1 0 GC 17
4 gbt
6 gbu
2-6 1 1 1 AA 19
2-7 1 1 1 AT 8
2-8 1 1 1 AC 14
2-9 1 1 1 GA 13
2-10 1 1 1 AG 0
5 gbt
5 gbu
1 1 2 distincs, uniques, K-len-Sum
11 11 9 20 # of totals, # of distincts, # of uniques, # of K-len-Sums

Now the interesting step is to come: step one bp to the right and repeat the process until we have all positions evaluated to group the k-mers "2-0" to "2-10" like above.... Dont' care about distincts and uniques, this I could wave in on my own.

Posted: **Wed May 05, 2021 4:36 pm**

@Wilbert,

you would'nt need to jumpt into the k-mer matter.
The only wish I have is to build in a trigger on a single word in String1 which then issues a trigger for String2.

Like the sample of Idle: "CT" as string1 has a lenght of 2. Then a #LAST_SEPARATION_CHAR as the separation trigger in string2 should be raised.

Even the repeating loop i could try to integrate, if you allow.

Your comment i.r. to 2 bit values for ACGT is right. A register compare of 32 brings additional logic for the flexibility of len 1,2,3 ... 5,6,7 etc.

Posted: **Wed May 05, 2021 11:20 pm**

Sooraa wrote: ↑Wed May 05, 2021 4:18 pm Idle,

I put your 21-sample into my logic. As you can see, "CT" is found 3 times as k-mer "2-1" there. Your display
"AG--TTTCATT--GA--GCAA" simulates exactly. The only difference are the positions for "DNAIndex" 4 as 2 ... 30 as 15. DNA files are Ascii, perhaps you grab the number as Unicode-space.

KmerLen: 2-2, 2-2-Frequency: 11, file-size/nucleotids: 21
Legend: gbt = group by totals, gbu = group by uniques

k-mer totals distin. unique DNAsearchSeq DNAIndex
2-0 4 1 0 TT 9
1 gbt
2-1 3 1 0 CT 15
1 gbt
2-2 2 1 0 TC 10
2-3 2 1 0 CA 18
2-4 2 1 0 TG 16
2-5 2 1 0 GC 17
4 gbt
6 gbu
2-6 1 1 1 AA 19
2-7 1 1 1 AT 8
2-8 1 1 1 AC 14
2-9 1 1 1 GA 13
2-10 1 1 1 AG 0
5 gbt
5 gbu
1 1 2 distincs, uniques, K-len-Sum
11 11 9 20 # of totals, # of distincts, # of uniques, # of K-len-Sums

Now the interesting step is to come: step one bp to the right and repeat the process until we have all positions evaluated to group the k-mers "2-0" to "2-10" like above.... Dont' care about distincts and uniques, this I could wave in on my own.

Sorry I'm not clear on what you want doing. A test of a 10000 characters sequence, k=5 and we're interested in finding chunks of a sequence AATGAGCG when k=5 the k-mers of interest are AATGA ATGAG TGAGC GAGCG
Is this what you expect it to do?

don't worry about the positions at the moment. just worry about the processing

AATACGCCAAGTCCATGCTATCCAGTGAATACGAGGCGCTAGATTTACAAGAACATACGAACTGAAATAAAGTTTTGCTAGCTTAGGGAGGGGTCCAATG...
AATGAGCG
42 k-mers found in sequence
item count positions
AATGA 12 2142: 3746: 3754: 5474: 6148: 11476: 12712: 13192: ...
ATGAG 14 3578: 4128: 5042: 5164: 6968: 8704 : 11322: 12566: ...

histogram of the sequence k=5
AAAAA 14
AAAAC 13
AAAAG 11
AAAAT 14
AAACA 7
...
TTTGT 10
TTTTA 6
TTTTC 8
TTTTG 11
TTTTT 10

Posted: **Thu May 06, 2021 2:35 pm**

@Idle:
K-mer counting is to find medium sized (up to 110) nucleotide-sequences in a genome file (ATGC)
Question is not to find the needle in a haystack. In this case you would know the needle.

No, here we are searching for unknown prominent sequences, so called k-mer sequences. (where "k" is the length of the nucleotide-sequence under test (NUC)).
To generate any and all potential NUCs in a sequence-file we start with counting 1-mers and group the totals in ths case for A, T, C, G. (One could
assume that 1-mer counting is worthless, but in BioChemestry even the Stats for 1-mers are scrutinized between genomes)

In our case the programs collect the single k-mer results (starting at 1-mer to 31-mer in the most program cases(fits into an 64-bit-reg)). So my current PB-map-based k-mer counting program can be started with k-value of 1 and loops over the file to 21 f.e., depending on the genome-sequence. The counts are grouped by k and the length of generated and found sequences.

Already in the case of the mini-genome phi-X174 with a length of 5378 the counts show that, what they are generated for: f.e. the 9-mer
"AAAGAGATT" appears 4 times and the 5-mer "ACTCT" also just only 4 times, although the so called k-universes are 4^9 = 262.144 and 4^5 = 1.024.
The counts of the NUCs grow quadraticly and consume process time and memory: .... 21mer alone is 4.398.046.511.104, 22mer adds
another 17.592.186.044.416 to 21mer .... 31mer is 4.611.686.018.427.387.900.

So I want to see at the end the behavior of a trie (in your case nibble) in regards to size and primarily the times for the trie-adds against what I have so far (PB-map). This can be achieved by iterating over a string with 1-mer, 2-mer ...... to minimum 27-mer. The grouping of the results, the unique and distinct NUCs I will do then.

Posted: **Thu May 06, 2021 5:11 pm**

Sooraa wrote: ↑Thu May 06, 2021 2:35 pmK-mer counting is to find medium sized (up to 110) nucleotide-sequences in a genome file (ATGC)
Question is not to find the needle in a haystack. In this case you would know the needle.

It's starting to make sense to me now

A few questions ...
I have no idea how long a genome file can be. What's the maximum count for a 1-mer ?
Do you read the entire genome file to system memory or process it from disk ?
Is it required to verify the input data or can you trust the input only to contain the A,T,G and C characters ?

Posted: **Thu May 06, 2021 7:25 pm**

@Wilbert:
What's the maximum count for a 1-mer ? --> The Human Genome in the latest, but still not perfect content file: GRCh38.p2 with: 3,21 GB Ascii
Under http://hgdownload.soe.ucsc.edu/goldenPa ... ps/latest/ one can find "hg38.2bit" with a size of 810 MB 2-bit for 3,21 GB Ascii

Entire genome file to system memory or process it from disk ? --> Up to now per "ReadData" into memory. It's a matter of results of this
fact-finding-evaluation. I can count single 128-mers on responsible sized genomes, sort & group them and write the product to disk.

Verify the input data ? --> No, at the stage of k-mer counting the files are reliable. The earlier stages like so called "reads" have plenty of
genetic-errors (but already A,C,T,G only) and redundancies.

For your interest:
https://gatk.broadinstitute.org/hc/en-u ... 8-b37-hg19
http://genome.ucsc.edu/cgi-bin/hgGatewa ... an&db=hg38
http://genome-euro.ucsc.edu/cgi-bin/hgG ... e.ucsc.edu
To my knowledge the longest sequence found is 141

PureBasic Forums - English

A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm

Re: A small procedure asm