@Idle, @Wilbert,
Code: Select all
it's good to hear that you stick with the problem.
Wilbert, to clarify it perhaps a bit more and to show, where I am (pseudo code) :
Structure KmerCombo ; it's 2*32 = 64 bits
StructureUnion
KmerQuad.q
DNAindex.l
K_Count.l
EndStructureUnion
EndStructure
NewMap KmerMap.Kstruc(15000000 * 2)
For i = 1 To 31
For KmerLen = i To i
For DNAindex = 0 To BytesRead - KmerLen ; (in Case of ecoli = 4.7 mio)
DNAsearchSeq = PeekS(*Buff + DNAindex, KmerLen, #PB_Ascii) ; PB-strings internally are in Unicode (since 5.50 7/2016), Ascii ----> Unicode
AddMapElement(DNAmap(), DNAsearchSeq) ;, #PB_Map_NoElementCheck)
DNAMAP()\K_Count + 1
If DNAMAP()\K_Count < 1
DNAMAP()\DNAindex = DNAindex
EndIf
Next DNAindex
Next KmerLen
ForEach KmerMap
; append this single K-mer to output-file...
; ... WriteString, WriteData with generating DNAsearchSeq of the first and single DNAindex ....
Next
; depending on experiments sort via array before ....
ClearMap(KmerMap())
Next i
- grouping and sorting in the output is achieved this way
- memory size is all about a single KmerLen, then cleared
- DNAMAP()\DNAindex is accessed only once per DNAsearchSeq
- DNAsearchSeq with a mini-alphabet (ACGT) is implicitly the hash of the KmerMap(), the element is just an q.-integer
- We only grab the unique's from the input file via the DNAindex....
- ALL(!) DNAindexes are over-adressed and so nice to have. If needed, it could be generated easily by a dedicated Findstring kind of thing,
or by "#PB_Map_NoElementCheck" (but this let's explode memory)