Page 1 of 1
string searching
Posted: Thu Oct 14, 2004 6:35 pm
by localmotion34
i'm writing a program to search through a string, a string of nucleotides in DNA. the function needs to look at the user input, search the string and compare each set of characters to a pre-defined sequence.
AAATTGCATAATCGGGATACGCGCGTA
---this would be a DNA sequence
the restriction enzyme BamHI recognizes the sequence GATAC
the restriction enzyme HinDIII recognizes the sequence CGCGT
now i need to run through about 300 more different enzyme recognition patterns, searching DNA sequences up to 8000 characters. any ideas where to start?
Posted: Thu Oct 14, 2004 7:43 pm
by wilbert
I don't know much about dna and enzymes. What's the longest enzyme you need to check ? Are they all 5 characters ?
Posted: Thu Oct 14, 2004 9:11 pm
by Pupil
Maybe you can use hash tables?
Posted: Thu Oct 14, 2004 9:21 pm
by localmotion34
hash tables? can you explain?
Posted: Thu Oct 14, 2004 9:37 pm
by Pupil
You can use a 16 bit hash (checksum, CRC16 comes to mind) algorithm and convert your search strings into a checksum that you store in a list, you also need to store the number of characters for that sequence. Then you use the same hash algorithm on the DNA sequence and compare with the list you've created. You can also store the search string checksums in a lookuptable for quick determination if you got a positive hit...
Posted: Thu Oct 14, 2004 10:11 pm
by localmotion34
errrrr, ok. honestly, that was greek to me. can anyone post a simple example for me? just to get me started.
Searching for Matching Strings Within Strings.
Posted: Fri Oct 15, 2004 4:00 am
by oldefoxx
Since you are talking about 8000 or so sequences, I would suggest that
you create a text file with one sequence per line, such as:
GATAC ;restriction enzyme BamHI
CGCGT ;restriction enzyme HinDIII
The format permits any combination of characters. The semicolon
marks off an optional comment. You can add to, modify, or rearrange
the contents of this file whenever you want. Using it as your source,
your program can easily process the whole DNA sequence.
The DNA sequence can either be embedded into the program if it is
static, or could be read from the same file, or a different one. Since
PureBasic has a maximum string length for any one string of just over
64,000 characters, the DNA sequence cannot be longer than that, or you
will have to find another way to hold the sequence, possibly in an array
or in a memory bank.
Searching for one of your sequences is then just a matter of reading the
sequence file one line at a time, discarding everything from the first
space to the end of the line, then using FindString(DNA$,Sequence$).
IF the value returned is non-zero, the sequence was found, and the
value indicates its offset from the beginning of the DNA$ string. If
you find you have to put the DNA sequence into a memory block, you
can use the CompareMemoryString() function instead.
One problem to recognize is the possibility of overlapping sequences.
That is, two sequences appear to both be in in the DNA sequence, but
a portion of the DNA sequence appears to be the trailing part of one
sequence and the leading part of the other. For instance, using the
two example sequences above, if you found a section of the DNA
sequence that was coded GATACGCGT, you would get a match for both
GATAC, beginning at the first character, and CGCGT, beginning at the
5th character. But the DNA sequence may have its own rules as to
whether overlapping sequences are valid - it may be bound by rules
that say that each found sequence must be positioned on a certain
boundary to be valid. If the sequences you are looking for are all
five characters in length, then you could use the MOD(offset,5)=1 test
to ensure any match begins on a boundary, such as 1, 6, 11, 16, etc.
You would have to work with an offset in FindString() to work beyond
the first find and on to any subsequent one if you have to cope with
boundary restrictions.
If coded sequences could be of variable length, then you cannot be sure
of what the current boundary should be. For instance, if one valid
sequence is five characters, another is six characters, then the order
of which is first, five followed by six, or six followed by five, would
cause a difference in the boundary position between the two. This problem would then mandate that you have to identify and process each
sequence in the natural order as it would be found in the DNA sequence,
and either know where its starting position would be, or replace each
found sequence in the DNA string with an equal number of spaces or
other non-sequence character. For instance, if you find "GATAC",
you would replace these same characters with something like ".....".
This would prevent a false match to something else in the future.
So like I said, it is fairly straight forward as to how to find matches, but
the trick is to avoid the possibility of false matches, and the way to do
this is to incorporate rules that are in accord with the way that nature
appaewntly has the DNA sequence interpreted.
Posted: Fri Oct 15, 2004 5:31 am
by wilbert
I'm sorry you didn't answer my question. If the sequences you have to search for are small (a width of 10 or below) I think I know a very fast solution that can search through a sequence with a width of 8000 in a few milliseconds but I don't know if they all are that small.
Posted: Fri Oct 15, 2004 7:47 am
by oldefoxx
I'm sorry Wilbert, but who are you addressing your question to? If to
me, I know nothing about DNA sequencing per se, but I am very
familiar with techniques for finding data within strings. The answer I
gave you shows some of the considerations in trying to match up
possible sequences while avoiding possible errors.
However, a quick google query gave results that the length of sequence
groups range from 6 to ten elements in human DNA. But that bit of
research also indicated that such groups are often repetative within
a DNA sequence, and generally classified into three groups: Highly
repetative (measured in hundreds of thousands of copies), moderately
repetative, and either singular or slightly repetative. The reason seems
to be nature's way of insisting that some traits need to be more
prevalent than others.
Generating the query file is the trick - you can only search for
sequences that you know already, to see if they appear in a
particular DNA sequence. But this technique is absolutely no good
if it comes to identifying new sequences, unless you use known
sequences to eliminate themselves from the DNA sequence so that
you can focus your further research on the remaining elements in the
DNA sequence. The problem is, that based on the info quickly
acquired above, DNA sequences must be much longer than a mere
8,000 elements, and you may have to allow for many duplicates
within the DNA sequence. I'm not sure how you intend to resolve
that problem.
The six to ten elements seem in contrast to the two groups that were
posted in the example, which only had five elements each.
If you know that all valid sequences are, say, ten elements in number
or less and the minimum number of elements is five or six, you can
search for a match against the first ten (maximum length) elements of
the DNA sequence.
If you find a match within the first ten character positions, but not starting
at the first character position, you might infer that you have a unknown
sequence set forth starting at the first character position. You can cut that sequence away for further study.
Now with the found sequence starting at the first portion of the DNA sequence , you cut the DNA sequence again at the found sequence length plus 1. You also range through the whole of the DNA sequence and
eliminate all copies, shortening the DNA sequence each time a matching
group is eliminated. If you need information on where any match
occurs, then you just replaced the group characters in the DNA
sequence with some invalid character (such as ".") to maintain
character positions.
If you made no match, you cut the DNA sequence after the least length
position, which would be starting at the sixth or seventh position. Then you retain that cut segment separately.
You repeat the process again and again, each time shortening the DNA
sequence from the front, but maintaining a count of how many
characters have been cut from the front so that you know how far
into the DNA sequence you have managed to get.
This cycle will keep reducing the length of the DNA sequence by at
least five or six characters. This means that the DNA sequence
is going to keep getting shorter and shorter, and the reported
sequences will be reported in the order that they first appeared in
the DNA sequence.
This approach also gives a hint of what group sequences are there that
have yet to be identified, the order in which they occur in the DNA
strand, even possibly the first position where found. A little more
work and you can count up the number of copies of the same sequence
group as found in the DNA strand.
Now that is about all I can tell you. What you do with this, or how you
do it, still remains to be resolved. But I am not sure if you are looking
for someone to explain DNA processing or techniques for finding one
string within another, or someone to contribute code to your project.
Your best course of action is to decide exactly what the process for
analyzing a DNA structure should be, then steer the discussion into the
best means of doing certain tasks, with the particulars spelled out so
that the code gives you exactly what is required.
Posted: Fri Oct 15, 2004 10:03 am
by GedB
Is a String the best way to represent your DNA?
My understanding is that that there are four kinds of nucleotides, but you are using a character which is represented by a byte with 256 possible values.
http://en.wikipedia.org/wiki/Deoxyribonucleic_acid
Within that range of 64 values you are using very arbitary positions: A= 65, T = 84, C = 67 and G = 71.
Since you only have 4 possible values, you only need 2 bits to store it.
This means that each byte can store up to 4 necleotides.
However, according to the Wikipedia they are grouped into sets of 3 (codons). It may be more useful to ignore 2 bits of each byte and have each byte represent a codon.
Using this binary approach will allow you to write some very fast search routines using ASM.
Posted: Fri Oct 15, 2004 11:58 am
by wilbert
Oldefoxx, I was referring to localmotion34 who asked to quesion. It's hard to give a good advice if I don't understand exactly what is supposed to happen.
Like GedB, I would suggest a numeric approach. Since you only got G, A, T and C, all of them can be represented by two bits. If the sequences to search for aren't more than 16 chars, all sequences that need to be searched for can be represented by a 32 bit long value.
If you create lookup tables for all different widths, all you have to do each time is slide in two bits from the right, apply different masks for each width and the result values can directly be used for the lookup tables. The lookup tables could contain zero if the sequence has no meaning or a value or string pointer if it's relevant. Such a routine could easily be ported to asm for speed optimization since a register can contain a long value and you already got instructions that can shift bits.
Posted: Sat Oct 16, 2004 10:09 am
by GedB
localmotion34,
Given it more thought. The approach you want is to create a state machine, represented by a tree. This is the method used to code regular expressions.
This approach will allow you to search for an unlimited amount of patterns in just one pass.
The principle is this. I'm going to explain with a sequence of just three possible values : A, C and G. Adding more is simple to code, but hard to explain
Lets say with have 4 patterns to test
AAC
ACG
GCA
AC
From thes patterns we can build a state tree whose branches will lead to either success or failure.
The tree for this one looks like this (* Indicate a match):
Code: Select all
Level Root
/ \
1 A G
/ | |
2 A C* C
| \ /
3 C* G* A*
Imagine we are testing against the sequence AACGCA. It gives us 4 matches:
Code: Select all
1 2 3 4 5 6
Char A A C G C A
1 A A C
2 A C G
2 A C
4 G C A
Then you iterate through the string, taking one character at a time. For each character you create a pattern tree and test it against the root. It you have success the tree remains. You progress along its branch and test it against the next character.
In our example we take the first character: A.
To test we start at the root, that offers 2 branches: A and G.Anything beginning with C will fail straight away. We begin with A, so we test that against root. We have a branch for A, so we take it to node A1
Next we take the second character: A again.
We create a new pattern tree for char 2 and it too moves to node A1. The pattern tree for char 1 also survives to node A2.
Then comges C.
New tree for char 3: Failure. No branch for C. We delete this tree and will hear no more of it.
Tree for Char 2: Success. There is a branch for C. Also, C2 is marked as a match, so we save the match.
Tree for char 1: Success. C3 is another match node. So we save our match.
Character 4 next: G.
New tree for char 4: Success. Progress to node G1.
Tree for char 2: Success. G3 is this tree's second match!
Tree for char 1: Failure. It is deleted after it's only match. Poor thing.
I'm sure that's enough to give you the idea. I'd start with string encoding first. Then you could optimise with a binary encoding if you wanted.
Posted: Sat Oct 16, 2004 7:07 pm
by oldefoxx
I posted a reply last night, but it doesn't show up here. Maybe my use
of non-PureBasic code reference caused it to be edited out. Maybe
a network or emal server problem caused it to be lost (Cox is still
fixing hurricane damage to its network here in Pensacola).
In any case, I am not going to repeat myself. But I posted code and
got another related discussion started at
http://www.powerbasic.com/support/forum ... q+Checking
Check it out if interested. Or you can go to
www.PowerBasic.com,
select forums, then under User-to_User Discussions, Pick
PowerBasic Console Compiler, then find the topic Rudimentary DNA
Seq Checking.
Posted: Tue Oct 19, 2004 9:21 am
by GedB
Finally managed to get round to coding the tree approach.
Code: Select all
;The patterns will be held in an array
NewList Patterns.s()
AddElement(Patterns())
Patterns() = "#ERR"
Structure PatternState
a.l
c.l
g.l
t.l
match.l
EndStructure
;The patterns will also be in a global linked list
;as a traversable tree
NewList PatternTree.PatternState()
AddElement(PatternTree()) ;Set the first element
;When a match is performed all of the matches will be
;placed in a linked list.
Structure SequenceMatch
PatternID.l ;The id of the matched pattern
StartPos.l ;The position in the squence that the pattern begins at
EndStructure
NewList SequenceMatches.SequenceMatch()
Procedure AddPattern(Pattern.s)
;Add our pattern to the state tree.
;Move to the beginning of the pattern tree
FirstElement(PatternTree())
;*PreviousElement is used to back reference
DefType.PatternState *PreviousElement
For i = 1 To Len(Pattern)
char.b = Asc(LCase(Mid(Pattern, i, 1)))
;As we traverse the tree we will either find that there is
;a branch either exists or does not.
;If it exists, use it.
;If it does not exist, create it.
Select char
Case 'a'
If PatternTree()\a = 0 ;We need to create a new branch
*PreviousElement=PatternTree()
LastElement(PatternTree())
AddElement(PatternTree())
*PreviousElement\a=ListIndex(PatternTree())
Else ;We will take the existing branch
SelectElement(PatternTree(),PatternTree()\a)
EndIf
Case 'c'
If PatternTree()\c = 0 ;We need to create a new branch
*PreviousElement=PatternTree()
LastElement(PatternTree())
AddElement(PatternTree())
*PreviousElement\c=ListIndex(PatternTree())
Else ;We will take the existing branch
SelectElement(PatternTree(),PatternTree()\c)
EndIf
Case 'g'
If PatternTree()\g = 0 ;We need to create a new branch
*PreviousElement=PatternTree()
LastElement(PatternTree())
AddElement(PatternTree())
*PreviousElement\g=ListIndex(PatternTree())
Else ;We will take the existing branch
SelectElement(PatternTree(),PatternTree()\g)
EndIf
Case 't'
If PatternTree()\t = 0 ;We need to create a new branch
*PreviousElement=PatternTree()
LastElement(PatternTree())
AddElement(PatternTree())
*PreviousElement\t=ListIndex(PatternTree())
Else ;We will take the existing branch
SelectElement(PatternTree(),PatternTree()\t)
EndIf
Default
Debug "Pattern '" + Pattern + "' should only contain 'a', 'c', 'g' Or 't'"
ProcedureReturn 0
EndSelect
Next i
;Store our pattern
LastElement(Patterns())
AddElement(Patterns())
Patterns() = Pattern
;The index is the patterns new id
PatternID = ListIndex(Patterns())
;Set the current node of the pattern tree as a match
PatternTree()\match = PatternID
ProcedureReturn PatternID
EndProcedure
Structure PossibleMatch
StartingPosition.l
*CurrentElement.PatternState
EndStructure
NewList PossibleMatches.PossibleMatch()
Procedure ShowPossibleMatches()
Debug ""
Debug "PossibleMatches:"
ForEach PossibleMatches()
Debug " " + Str(PossibleMatches()\StartingPosition)
Next
EndProcedure
Structure Match
PatternID.l
StartingPosition.l
EndStructure
NewList Matches.Match()
Procedure MatchSequence(Sequence.s)
;Clear PossibleMatches and Matches for a new run
ClearList(PossibleMatches())
ClearList(Matches())
;Iterate through the given sequce.
For i = 1 To Len(Sequence)
;For each symbol in the sequence create a new match
;A match will remain as long as it has a branch
;to follow. When it reaches a dead end it is eliminated
;If, anywhere along the way, it encounters a match we
;save it.
char.b = Asc(LCase(Mid(Sequence, i, 1)))
;Add a new PaternMatch
FirstElement(PatternTree())
LastElement(PossibleMatches())
;Set it to the root of the PatternTree
FirstElement(PossibleMatches())
AddElement(PossibleMatches())
PossibleMatches()\StartingPosition = i
PossibleMatches()\currentElement = PatternTree()
ShowPossibleMatches()
FirstElement(PossibleMatches())
Debug Str(i) + " : " + Chr(char)
ForEach PossibleMatches()
ElementDeleted = #False
Select char
Case 'a'
If PossibleMatches()\currentElement\a = 0 ;The path ends here, so does this match
ElementDeleted = #True
Else ;We can keep going
SelectElement(PatternTree(), PossibleMatches()\currentElement\a)
PossibleMatches()\currentElement = PatternTree()
EndIf
Case 'c'
If PossibleMatches()\currentElement\c = 0 ;The path ends here, so does this match
ElementDeleted = #True
Else ;We can keep going
SelectElement(PatternTree(), PossibleMatches()\currentElement\c)
PossibleMatches()\currentElement = PatternTree()
EndIf
Case 'g'
If PossibleMatches()\currentElement\g = 0 ;The path ends here, so does this match
ElementDeleted = #True
Else ;We can keep going
SelectElement(PatternTree(), PossibleMatches()\currentElement\g)
PossibleMatches()\currentElement = PatternTree()
EndIf
Case 't'
If PossibleMatches()\currentElement\t = 0 ;The path ends here, so does this match
ElementDeleted = #True
Else ;We can keep going
SelectElement(PatternTree(), PossibleMatches()\currentElement\t)
PossibleMatches()\currentElement = PatternTree()
EndIf
Default
Debug "Sequence '" + Sequence + "' should only contain 'a', 'c', 'g' Or 't'"
ProcedureReturn 0
EndSelect
If ElementDeleted = #False
Debug Str(PossibleMatches()\StartingPosition) + " - " + Str(ListIndex(PatternTree()))
If PossibleMatches()\currentElement\match > 0
LastElement(Matches())
AddElement(Matches())
Matches()\StartingPosition = PossibleMatches()\StartingPosition
Matches()\PatternID = PossibleMatches()\currentElement\match
EndIf
Else
DeleteElement(PossibleMatches())
EndIf
Next
Next i
EndProcedure
Procedure ShowPatternTree()
Debug ""
Debug "PatternTree:"
ForEach PatternTree()
Debug " " + Str(ListIndex(PatternTree())) + ": " + Str(PatternTree()\a) + " " + Str(PatternTree()\c) + " " + Str(PatternTree()\g) + " " + Str(PatternTree()\t) + " " + Str(PatternTree()\match)
Next
Debug ""
Debug "Patterns:"
ForEach Patterns()
Debug Str(ListIndex(Patterns())) + ": " + Patterns()
Next
EndProcedure
Procedure ShowMatches()
Debug ""
Debug "Matches:"
ForEach Matches()
Debug " " + Str(ListIndex(Matches())) + ": " + Str(Matches()\StartingPosition) + " " + Str(Matches()\PatternID)
Next
Debug ""
Debug "Patterns:"
ForEach Patterns()
Debug Str(ListIndex(Patterns())) + ": " + Patterns()
Next
EndProcedure
Debug AddPattern("ACGT")
Debug AddPattern("TAcGA")
Debug AddPattern("TAcG")
Debug AddPattern("ACCB")
Debug AddPattern("TTTT")
ShowPatternTree()
MatchSequence("TACGTTTTACGA")
ShowMatches()