A small procedure asm

Bare metal programming in PureBasic, for experienced users
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: A small procedure asm

Post by Little John »

fvillanova wrote:

Code: Select all

Procedure.d Combinations(n.i,k.i)
Protected calc.d  
  calc=Factorial(n)/((Factorial(n-k))*Factorial(k));
 ProcedureReturn calc
EndProcedure
This procedure calculates the so-called "Binomial Coefficient", i.e. the number of combinations or subsets (unordered samples without repetition) of k elements from a set of n elements.
Besides wilbert's suggestions, another way to improve the speed of this procedure is not to call the Factorial() function three times, but to do only the computations that are actually required. An example for doing so is the procedure Choose() in this thread.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

Little John wrote:another way to improve the speed of this procedure is not to call the Factorial() function three times, but to do only the computations that are actually required. An example for doing so is the procedure Choose() in this thread.
That looks pretty nice :)

I could increase the speed of the Combinations procedure a lot by using asm and double precision variables but I don't know if that would be accurate enough.
But I noticed that a big contribution of why it is slow is because of all the string operations involved. An array or linked list might be a more logical way to store numbers.
Windows (x64)
Raspberry Pi OS (Arm64)
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: A small procedure asm

Post by Little John »

wilbert wrote:
Little John wrote:That looks pretty nice :)
Thanks. :-)

The following code is about twice as fast as the previously mentioned code (depending on the values of n and k).
However, in the following code an overflow will occur already for smaller values of n and k.

Code: Select all

Procedure.q Choose (n.i, k.i)
   ; -- Binomial Coefficient, "n choose k":
   Protected i.i, ret.q
   
   If k < 0 Or n < k Or n > 61
      ProcedureReturn -1           ; Illegal function call
   EndIf
   
   If k > n - k
      k = n - k                    ; This way k is maximally = Int(n/2).
   EndIf
   
   ret = 1
   For i = 1 To k
      ret * n / i
      n - 1
   Next
   
   ProcedureReturn ret
EndProcedure


Debug Choose(6, 4)   ; = 15
Last edited by Little John on Wed May 06, 2015 12:14 pm, edited 2 times in total.
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

Little John wrote:The following code is about twice as fast as the previously mentioned code (depending on the values of n and k).
However, in the following code an overflow will occur already for smaller values of n and k.
Wow, I didn't know it was that simple to calculate.
Probably because I don't know very well what's it all about.
Your fast procedure seems to work well with n = 100; I don't see an overflow.
Windows (x64)
Raspberry Pi OS (Arm64)
Little John
Addict
Addict
Posts: 4519
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: A small procedure asm

Post by Little John »

I'm sorry. The code in my previous post had a bug. :oops: Now fixed.

The code is still pretty simple.

Code: Select all

This is the definition of the Binomial Coefficient:

 / n \        n!
 |   |  =  ---------
 \ k /     k!*(n-k)!

fvillanova 's code follows the definition literally.

Looking at an example, say 6 choose 4, we get this:

        1*2*3*4 * 5*6     5*6
        -------------  =  ---
        1*2*3*4 * 1*2     1*2
        
This is what the procedure in my previous post does. It just calculates the reduced fraction.
         6/1 * 5/2

(It's also possible to reduce intermediate results, but that makes the code a little longer.)
wilbert wrote:Your fast procedure seems to work well with n = 100; I don't see an overflow.
For a given value of n, there might be no overflow for rather small or rather large values of k.
The greatest risk for getting an overflow (for big n) does exist, when k = Int(n/2).
fvillanova
User
User
Posts: 70
Joined: Wed May 02, 2012 2:17 am
Location: Brazil

Re: A small procedure asm

Post by fvillanova »

wilbert wrote:
fvillanova wrote:Within the calculation routines that I use in my programs are statistical distribution calculations where I need to make millions simulation with factors lexicographical, I do not know if you have seen this kind of calculation.
No, haven't seen before :)
Haven't got a clue what is going on.
Is it some sort of sorting related thing ?

In the end everything can be translated to asm but the question is if it's worth it.
In this case the best thing you can do is try to speed up the Factorial and Combinations procedures to start with.
Is there a limit to the values passed to the Factorial and Combinations procedures ?
If there is and it isn't very big, you could use a lookup table for this. That would already result in a big speed improvement.
I will explain you what is lexicographic order.
If you take 5 numbers (01 to 05) as an example
How many combinations of only two numbers are possíves with these five.
1-01 02
2-01 03
3-01 04
4-01 05
5-02 03
6-02 04
7-02 05
8-03 04
9-03 05
10-04 05
Answer is = 10 different possible combinations of 5 2-2

My procedure calculates the number of possible combinations, see:

Result=Combinations(5,2) "Result is 10"

Now, if I want to know what lexicographic order of "03 04"

Result=Lexicographic_Order("03 04",5,2) "Result is 8"

And finally, I want to know what are the elements of the combination 4?

aux=Elements_Return(5,2,4)
aux is "01 05" in the above example.

Yes, I can put all factor previously in a vector, but this does not greatly accelerates
the process, the more time-consuming processing is in Lexicographic_Order and Elements_Return procedures
see the change below:

Code: Select all

DisableDebugger

Global.i i,Result, time
Global.s aux
Global Dim da.s(99)
Global Dim FacData.d(100)

Procedure.d Factorial(N.i)
  
Protected E.d
E=1:For S=2 To N:E*S:Next
ProcedureReturn E
EndProcedure 
Procedure.d Combinations(n.i,k.i)
Protected calc.d  
  calc=FacData(n)/((FacData(n-k))*FacData(k))
 ProcedureReturn calc
EndProcedure
Procedure.i Lexicographic_Order(cval.s,nval.i,kval.i)   
  Protected.i lval, p1val, ival, jval
  Protected.d rval
  Protected Dim c.i(kval), Dim numv.i(kval): c(0)=0
  For ival=0 To kval-1: numv(ival+1)=Val(PeekS(@cval + ival * 3, 2,#PB_Ascii)): Next
  p1val=kval-1  
  For ival=1 To p1val: c(ival)=0
    If ival<>1: c(ival)=c(ival-1): EndIf  
    Repeat
    c(ival)+1: rval=Combinations(nval-c(ival),kval-ival)  
    lval=lval+rval 
    Until c(ival)>=numv(ival)
    lval=lval-rval   
  Next
    lval=lval+numv(kval)-numv(p1val)
  ProcedureReturn lval
EndProcedure  
Procedure.s Elements_Return(nval.i,pval.i,lval.i)      
  Protected rval.d, p1val.d, kval.d, ival.i, Dim c.i(pval), cval.s, auxval.s 

  p1val=pval-1: c(1)=0
  
  For ival=1 To p1val: If ival<>1: c(ival)=c(ival-1): EndIf
    Repeat
      c(ival)+1: rval=Combinations(nval-c(ival),pval-ival)
      kval=kval+rval
    Until kval>=lval
      kval=kval-rval: cval=cval+auxval+da(c(ival)): auxval=" "
  Next
  c(pval)=c(pval-1)+lval-kval: cval=cval+auxval+da(c(pval)) 
  ProcedureReturn cval
EndProcedure  

For i=0 To 99:If i<10:da(i)="0"+Str(i):Else:da(i)=Str(i):EndIf: Next

For i=0 To 100: FacData(i)=Factorial(i): Next

time = ElapsedMilliseconds()
For i=1 To 2000000
  
  aux=Elements_Return(15,8,6387)    ; constructs the Elements of lexicographic order = 6387 

  Result=Lexicographic_Order("05 08 09 10 12 13 14 15",15,8) ; calculate lexicographic order from "05 08 09 10 12 13 14 15"
  
Next  
time = ElapsedMilliseconds()-time  

MessageRequester("Lexicographic Case:","Processed in "+Str(time)+" milliseconds"+Chr(13)+aux+Chr(13)+Str(Result),#PB_MessageRequester_Ok): End
I modified the procedure Combinations() to:
"calc=FacData(n)/((FacData(n-k))*FacData(k))"
I've done this in some of my routines in other programs,
but the idea is to convert the two main procedures for asm.

To understand where these lexicographic routines are used I'll give you an example:

When an insurance company accepts an insurance policy it already
know how many risks are involved and the cost of each of these risks,
the amount of risk involved is 'n'
Each risk has an average cost,then I can do simulation 2 or 3 risk at a time to determine what
the costs of damages that may occur.
"Combinations(n,2)+Combinations(n,3)"
Using the lexicographic functions I can get 2 or 3 risks involved and check the cost of them
and find out what risks are involved in certain compensation expenses.
This is a simple example, just for you understand where these routines are used
in statistics simulations.

thanks
Fernando
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

Using a lookup for the combinations again results in a speed improvement compared to using a lookup for the factorial.
Because of all the string operations it's hard to convert the two routines to asm. I'll see if I can improve the speed.

Code: Select all

DisableDebugger

Global.i i,j,Result, time
Global.s aux
Global Dim da.s(99)

Procedure.d Combinations(n.l, k.l)
  Protected result.d, fstc.w
  !mov edx, [p.v_n]
  !mov ecx, [p.v_k]
  !sub edx, ecx
  !jc comb_return0
  !cmp ecx, edx
  !cmova ecx, edx
  !and ecx, ecx
  !jz comb_return1
  !mov [p.v_k], ecx
  !mov edx, ecx
  ; set 80 bit precision
  !fstcw [p.v_fstc]
  !fldcw [comb_fstc]
  ; main routine
  !fld1
  !fld1
  !fild dword [p.v_n]
  !comb_loop0:
  !fmul st2, st0
  !fsub st0, st1
  !dec ecx
  !jnz comb_loop0
  !fstp st0
  !fld1
  !fild dword [p.v_k]
  !comb_loop1:
  !fmul st2, st0
  !fsub st0, st1
  !dec edx
  !jnz comb_loop1
  !fstp st0
  !fstp st0
  !fdivp
  ; store result
  !fstp qword [p.v_result]
  ; restore fpu control word
  !fldcw [p.v_fstc]
  ProcedureReturn result
  !comb_return0:
  ProcedureReturn 0
  !comb_return1:
  ProcedureReturn 1
  !comb_fstc: dw 0x37f
EndProcedure

Global Dim Combinations_.d(99, 99)
For i = 0 To 99
  For j = 0 To 99
    Combinations_(i, j) = Combinations(i, j)
  Next
Next

Procedure.i Lexicographic_Order(cval.s,nval.i,kval.i)   
  Protected.i lval, p1val, ival, jval
  Protected.d rval
  Protected Dim c.i(kval), Dim numv.i(kval): c(0)=0
  For ival=0 To kval-1: numv(ival+1)=Val(PeekS(@cval + ival * 3, 2,#PB_Ascii)): Next
  p1val=kval-1  
  For ival=1 To p1val: c(ival)=0
    If ival<>1: c(ival)=c(ival-1): EndIf  
    Repeat
    c(ival)+1: rval=Combinations_(nval-c(ival),kval-ival)  
    lval=lval+rval 
    Until c(ival)>=numv(ival)
    lval=lval-rval   
  Next
    lval=lval+numv(kval)-numv(p1val)
  ProcedureReturn lval
EndProcedure  

Procedure.s Elements_Return(nval.i,pval.i,lval.i)      
  Protected rval.d, p1val.d, kval.d, ival.i, Dim c.i(pval), cval.s, auxval.s 

  p1val=pval-1: c(1)=0
  
  For ival=1 To p1val: If ival<>1: c(ival)=c(ival-1): EndIf
    Repeat
      c(ival)+1: rval=Combinations_(nval-c(ival),pval-ival)
      kval=kval+rval
    Until kval>=lval
      kval=kval-rval: cval=cval+auxval+da(c(ival)): auxval=" "
  Next
  c(pval)=c(pval-1)+lval-kval: cval=cval+auxval+da(c(pval)) 
  ProcedureReturn cval
EndProcedure  

For i=0 To 99:If i<10:da(i)="0"+Str(i):Else:da(i)=Str(i):EndIf: Next

time = ElapsedMilliseconds()
For i=1 To 2000000
  
  aux=Elements_Return(15,8,6387)    ; constructs the Elements of lexicographic order = 6387 

  Result=Lexicographic_Order("05 08 09 10 12 13 14 15",15,8) ; calculate lexicographic order from "05 08 09 10 12 13 14 15"
  
Next  
time = ElapsedMilliseconds()-time  

MessageRequester("Lexicographic Case:","Processed in "+Str(time)+" milliseconds"+Chr(13)+aux+Chr(13)+Str(Result),#PB_MessageRequester_Ok): End
Windows (x64)
Raspberry Pi OS (Arm64)
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

Here's what you can do with simple PureBasic optimizations

Code: Select all

DisableDebugger

Procedure.d Combinations(n.l, k.l)
  Protected result.d, fstc.w
  !mov edx, [p.v_n]
  !mov ecx, [p.v_k]
  !sub edx, ecx
  !jc comb_return0
  !cmp ecx, edx
  !cmova ecx, edx
  !and ecx, ecx
  !jz comb_return1
  !mov [p.v_k], ecx
  !mov edx, ecx
  ; set 80 bit precision
  !fstcw [p.v_fstc]
  !fldcw [comb_fstc]
  ; main routine
  !fld1
  !fld1
  !fild dword [p.v_n]
  !comb_loop0:
  !fmul st2, st0
  !fsub st0, st1
  !dec ecx
  !jnz comb_loop0
  !fstp st0
  !fld1
  !fild dword [p.v_k]
  !comb_loop1:
  !fmul st2, st0
  !fsub st0, st1
  !dec edx
  !jnz comb_loop1
  !fstp st0
  !fstp st0
  !fdivp
  ; store result
  !fstp qword [p.v_result]
  ; restore fpu control word
  !fldcw [p.v_fstc]
  ProcedureReturn result
  !comb_return0:
  ProcedureReturn 0
  !comb_return1:
  ProcedureReturn 1
  !comb_fstc: dw 0x37f
EndProcedure

Structure StringElement
  StructureUnion
    c.c[2]
    CompilerIf #PB_Compiler_Unicode
      n.l  
    CompilerElse
      n.w
    CompilerEndIf
  EndStructureUnion
  spacing.c  
EndStructure

Global Elements_Return_Buffer.s = Space(300)

CompilerIf #PB_Compiler_Unicode
  Global Dim Numbers_.l(100)
CompilerElse
  Global Dim Numbers_.w(100)
CompilerEndIf
Global Dim Combinations_.i(99, 99)

Procedure InitLookupTables()
  Protected i, j
  For i = 0 To 9
    PokeS(@Numbers_(i), "0" + Str(i), 2)
  Next
  For i = 10 To 99
    PokeS(@Numbers_(i), Str(i), 2)
  Next
  For i = 0 To 99
    For j = 0 To 99
      Combinations_(i, j) = Combinations(i, j)
    Next
  Next
EndProcedure

InitLookupTables()



Procedure.i Lexicographic_Order(*cval.StringElement, nval.i, kval.i)
  Protected.i c, l, n
  While kval > 1
    kval - 1
    n = *cval\c[0] * 10 + *cval\c[1] - 528
    *cval + SizeOf(StringElement)
    c + 1
    While c < n
      l + Combinations_(nval - c, kval)
      c + 1  
    Wend
  Wend
  l + (*cval\c[0] * 10 + *cval\c[1] - 528) - n
  ProcedureReturn l
EndProcedure  

Procedure.s Elements_Return(nval.i, pval.i, lval.i)
  Protected.i c, k, *cval.StringElement = @Elements_Return_Buffer
  While pval > 1
    pval - 1
    Repeat
      c + 1  
      k + Combinations_(nval - c, pval)
    Until k >= lval
    k - Combinations_(nval - c, pval)
    *cval\n = Numbers_(c)
    *cval + SizeOf(StringElement)    
  Wend
  *cval\n = Numbers_(c + lval - k)
  CompilerIf #PB_Compiler_Unicode
    ProcedureReturn PeekS(@Elements_Return_Buffer, (*cval + SizeOf(StringElement\c) - @Elements_Return_Buffer) >> 1)
  CompilerElse
    ProcedureReturn PeekS(@Elements_Return_Buffer, *cval + SizeOf(StringElement\c) - @Elements_Return_Buffer)
  CompilerEndIf  
EndProcedure  



Global.i i,Result, time
Global.s aux

time = ElapsedMilliseconds()
For i=1 To 2000000
  
  aux=Elements_Return(15,8,6387)    ; constructs the Elements of lexicographic order = 6387 
  
  Result=Lexicographic_Order(@"05 08 09 10 12 13 14 15",15,8) ; calculate lexicographic order from "05 08 09 10 12 13 14 15"
  
Next  
time = ElapsedMilliseconds()-time  

MessageRequester("Lexicographic Case:","Processed in "+Str(time)+" milliseconds"+Chr(13)+aux+Chr(13)+Str(Result),#PB_MessageRequester_Ok): End
Windows (x64)
Raspberry Pi OS (Arm64)
fvillanova
User
User
Posts: 70
Joined: Wed May 02, 2012 2:17 am
Location: Brazil

Re: A small procedure asm

Post by fvillanova »

Again I need to optimize a routine, see how I do in Perl programming
where you can allocate the entire contents of an array directly into
a string variable in one command:

$array[0]="Hello"; $array[1]="World";
$string="@array";


At this time the variable $string is equals "Hello World"
and automatically a space is placed between the array elements.
Perl is a great language for manipulating with strings.

There are several ways to do this in PureBasic but I need
a fast option that can process hundreds of thousands of these
situations per second.

That is, take all the elements of an string array (ASCII)
directly into a simple string variable (each element of the array
separated by a space).

Here is a way that works but is not fast enough:

Code: Select all

 Dim Array.s(1) ; The array may have hundreds of elements and all elements are in ASCII
 Define.s S
 #espace=" "
 Array(0)="Hello": Array(1)="World"
 S=Array(0): For i=1 To ArraySize(Array()): S=S+#espace+Array(i): Next
 MessageRequester("Debug",S,0)
Anyone have an idea for routine be faster?
thanks
Fernando
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

It's not asm, but try this

Code: Select all

Procedure.s Join(Array StringArray.s(1), Separator.s = " ")
  
  Protected.i asize, i, slen, tlen, *buffer
  asize = ArraySize(StringArray())
  slen = Len(Separator)
  For i = 0 To asize
    tlen + Len(StringArray(i)) + slen
  Next
  tlen - slen
  
  Protected Dim buffer.c(tlen)
  *buffer = @buffer()
  CopyMemoryString(StringArray(0), @*buffer)
  For i = 1 To asize
    CopyMemoryString(Separator)
    CopyMemoryString(StringArray(i))
  Next
  
  ProcedureReturn PeekS(@buffer())
  
EndProcedure


; test join procedure

Dim MyStrings.s(9999)
For i = 0 To 9999
  MyStrings(i) = Str(i)
Next

s.s = Join(MyStrings(), ", ")

Debug s
If you want to benchmark it, you need to do so with debugger disabled of course :wink:
Windows (x64)
Raspberry Pi OS (Arm64)
fvillanova
User
User
Posts: 70
Joined: Wed May 02, 2012 2:17 am
Location: Brazil

Re: A small procedure asm

Post by fvillanova »

Again I need to thank you Wilbert!
My routine has led 4280 milliseconds and your idea 311 milliseconds !!!
It was 14 times faster, see the test processed:

Code: Select all

 Dim Array.s(100) 
 Define.s A
 #espace=" "
 Define.i i,j,t1
 For i=0 To 100: Array(i)=Chr(Random(90,65)): aux=aux+Array(i): Next
 t1 = ElapsedMilliseconds()
 For j=1 To 100000 
   A=Array(0): For i=1 To ArraySize(Array()): A=A+#espace+Array(i): Next
 Next
 t1=ElapsedMilliseconds()-t1
 MessageRequester("Debug",Str(t1) + " ms" ,0)

Code: Select all

Procedure.s Join(Array StringArray.s(1), Separator.s = " ")
  Protected.i asize, i, slen, tlen, *buffer
  asize = ArraySize(StringArray())
  slen = Len(Separator)
  For i = 0 To asize
    tlen + Len(StringArray(i)) + slen
  Next
  tlen - slen
  Protected Dim buffer.c(tlen)
  *buffer = @buffer()
  CopyMemoryString(StringArray(0), @*buffer)
  For i = 1 To asize
    CopyMemoryString(Separator)
    CopyMemoryString(StringArray(i))
  Next
  ProcedureReturn PeekS(@buffer())
EndProcedure

; test join procedure
 Dim Array.s(100) 
 Define.s A
 #espace=" "
 Define.i i,j,t1
 
 For i=0 To 100: Array(i)=Chr(Random(90,65)): Next
 t1 = ElapsedMilliseconds()
 For j=1 To 100000 
   A=Join(Array(),#espace)
 Next
 t1=ElapsedMilliseconds()-t1
 MessageRequester("Debug",Str(t1) + " ms" ,0)
Thanks a lot for the help.
Fernando
fvillanova
User
User
Posts: 70
Joined: Wed May 02, 2012 2:17 am
Location: Brazil

Re: A small procedure asm

Post by fvillanova »

Hi Wilbert,
With your routine I made some changes and Now it will be more versatile, see:

Code: Select all

Procedure.s ArrayToString(Array StringArray.s(1),Separator.s=" ",StartE.i=0,EndE.i=1) 
  Protected.i i, slen, tlen, *buffer
  slen = Len(Separator)
  For i = StartE To EndE
    tlen + Len(StringArray(i)) + slen
  Next
  tlen - slen
  Protected Dim buffer.c(tlen)
  *buffer = @buffer()
  CopyMemoryString(StringArray(StartE), @*buffer): StartE+1
  For i = StartE To EndE
    CopyMemoryString(Separator)
    CopyMemoryString(StringArray(i))
  Next
  ProcedureReturn PeekS(@buffer())
EndProcedure
wilbert
PureBasic Expert
PureBasic Expert
Posts: 3870
Joined: Sun Aug 08, 2004 5:21 am
Location: Netherlands

Re: A small procedure asm

Post by wilbert »

I'm happy you can use it and make the changes which suit you best :)
Windows (x64)
Raspberry Pi OS (Arm64)
Sooraa
User
User
Posts: 48
Joined: Thu Mar 12, 2015 2:07 pm
Location: Germany

Re: A small procedure asm

Post by Sooraa »

@Wilbert:

In BioInformatics a method called "k-mer counting" is quite popular and helpful in DNA analyses.
My core-programs are based on Purebasic's maps, provide reverse complement and canonical k-mer options and are unfortunately limited to
PB's 5.40 because of Ascii. Especially on higher k-mers the map-concept shows it's limits in performance and memory consumtion.

A good overview of the functions and demands of Kmer-Counting is under
https://bioinfologics.github.io/post/20 ... roduction/

More by chance I stepped by the 2015-thread "A small procedure asm" and electrified my senses: A trie based on x64/x86,
even no "R-registers", no SSE, no AVX and minimum globals, Unicode/Ascii and with no limits on # of elements and element length!

I have the impression, that the potential speed-improvements of SSE/AVX are more than compensated by the less reg2SSE and SSE2reg
and the speed advantages through the parallism of instruction-primitives in multiple Execution Units (not cores).
I thought: "this is an promising approach". I can handle ASM-mechanics quite well but broke my fingers to adopt your example to k-mer
functionality.

Of course even your approch will not take the advantage of DNS 2-bit data representations, but to deal with compact 2-bits for "ACTG" data,
internally it's more flexible to stay with 8-bit-representation, even with the limits of one-EU-only BMI-instructions.

A good source for real DNA-data (besides little genomes as 2phi-X174 with 5386 bp and and ecoli with 4 mio. bps, even human genome sets -approx. 3 GB-) is available at https://www.ncbi.nlm.nih.gov/genbank/

As a principle simulation for your ASM I manually constructed a strictly 2-mer input of the ecoli bacteria DNA from the first base pairs:
Original: "AGCTTTTCATTCTGACTGCAA" ....
string1 = "CT"
string2 = "AG GC CT TT TT TT TC CA AT TT TC CT TG GA AC CT TG GC CA AA"
for you. Your ASM results out correctly as 3.

Can you imagine a version of your ASM-trie for a non-space-separated DNA k-mer version?

This would be be a superior alternative to the established "Jellyfish / KAT" solutions. If so, and combined with PB-graphics, I feel we could
have an alternative new DNA analysis base. It could be enhanced with char-by-char "longest common subsequence" / "longest common substring
options". I also have a memory-sparse one-row-Damerau-Levenshtein and OSA core function available in ASM yet. The expansion to multi-core
performance is also a topic and has to be seen.

Thank you in advance
User avatar
idle
Always Here
Always Here
Posts: 5019
Joined: Fri Sep 21, 2007 5:52 am
Location: New Zealand

Re: A small procedure asm

Post by idle »

Interesting, I didn't know what k-mers was, wouldn't you typically do an analysis over a range of k rather than a fixed k

Tries are tricky structures if you want a generic trie implimention you could try Squint which should work well for larger k and should be quicker than a map for lookups as it can bail out earlier on no match rather than processing the entire key.
viewtopic.php?f=12&t=75783

heres an example that could be adapted to do a range of k without the spaces
viewtopic.php?p=569612#p569612
Post Reply