Page 1 of 2

Compare String

Posted: Mon Apr 29, 2013 8:05 pm
by martin66119
Hi,

I am looking for a procedure to compare strings very, very fast.

example:
I have to lists.

List 1: 100000 Stings (each string has about 140 different characters)
List 2= 100100 Strings (each string has about 140 different characters)

Some strings are different between List 1 and List 2 (i.g. 123)

The problem is to find out the different strings between List 2 and List 1.

The following procedure ist to slow.

ForEach List1()
For Each Liste2()
If List1() <> List2())
AddElement(List3())
List3() = List1()
endif
Next
Next

Thank you very much for your help.

I am looking for a codeexample.
Martin

Re: Compare String

Posted: Mon Apr 29, 2013 8:35 pm
by davido
You could probably at least double the speed if you used string arrays rather than lists.

Re: Compare String

Posted: Mon Apr 29, 2013 8:52 pm
by idle
use a map with the string as the key
that way you can check for duplications as your filling it

Another way it to use a trie though whether it would be faster than a map
is more a case of code optimization.

Re: Compare String

Posted: Mon Apr 29, 2013 9:05 pm
by skywalk
Use arrays and CompareMemoryString().

Re: Compare String

Posted: Mon Apr 29, 2013 9:29 pm
by martin66119
Thank you very much for your help.

First of all: I am a beginner and am I not familar with

Ergebnis = CompareMemoryString(*String1, *String2 [, Modus [, Länge [, Flags]]])

Right now I have two list

ReadFile(1,FileName1$)
While Eof(1)=0
Zeile.s=ReadString(1)
AddElement(Liste2())
Liste1() = Zeile.s
Wend

ReadFile(1,FileName2$)
While Eof(1)=0
Zeile.s=ReadString(1)
AddElement(Liste2())
Liste2() = Zeile.s
Wend

How can I use the CompareMemoryString(*String1, *String2 ) to compare both List in a loop.

Sorry for the question

Re: Compare String

Posted: Mon Apr 29, 2013 9:30 pm
by Kiffi
martin66119 forgot to mention the corresponding thread in the german forum:

http://www.purebasic.fr/german/viewtopi ... 90#p311612

Greetings ... Kiffi

Re: Compare String

Posted: Mon Apr 29, 2013 9:32 pm
by martin66119
Yes you are right. I have forgotten to mentioned that.

Re: Compare String

Posted: Mon Apr 29, 2013 10:06 pm
by idle
well if your reading from a file you could adapt this

list 1 holds the main list from s1
list 2 stores the duplicates from s2
list 3 stores the unique from s2
list 4 the order of the second list pointing to items in list2 and list 3

Code: Select all

Global NewMap MA.s() 
Global NewList list1() ;main list 
Global NewList List2() ;duplicates 
Global NewList list3() ;unique 
Global NewList list4() ;order of the second list pointing to items in list2() or list3() 

Procedure DeDup(s1.s,s2.s) 
   
   AddMapElement(MA(),s1) 
   AddElement(list1())
   MA() = s1 
   list1() = @MA() 
   If FindMapElement(ma(),s2)
      AddElement(list2()) 
      AddElement(list4())
      list2() = @MA() 
      list4() = @list2() 
   Else 
      AddMapElement(MA(),s2)
      MA() = s2
      AddElement(list3())
      AddElement(list4()) 
      list3() = @MA()
      list4() = @list3() 
   EndIf    
    
EndProcedure 

Procedure FillLists() 
   
   For a = 1 To 100 
      If a % 2 
         b= a + 100 
      Else 
         b=a 
      EndIf    
      DeDup("some string " + Str(a),"some string "+ Str(b))      
   Next 
   
EndProcedure 

FillLists()
*ps.string 

ForEach list2() 
   *ps = list2() 
   Debug "duplicate "  + *ps\s 
Next 
ForEach list3() 
   *ps = list3() 
   Debug "unique "  + *ps\s 
Next 

ForEach list4() 
   *ps = PeekI(list4()) 
   Debug "order " + *ps\s  
Next    


Re: Compare String

Posted: Mon Apr 29, 2013 10:43 pm
by martin66119
Thanks a lot for the example.

Do you have an other example using CompareMemoryString.
I do not know if the following lines are ok!

Code: Select all

Dim Array1 (20000)
Dim Array2 (20000)
Dim Array3 (20000)

j = 0
For i = 1 to  LastElement Array1
   *String1 = @Array(i)
       For j = 1 to  LastElement Array2
          *String2 = @Array2(j)

           Result = CompareMemoryString(*String1, *String2))

           If Result = #PB_String_Equal
           else
              Array3(j) = *String1
              j = j +1
           Endif
     Next
Next

Re: Compare String

Posted: Mon Apr 29, 2013 11:43 pm
by J. Baker
If you're only comparing two strings... If FindString() = 0. ;)

Re: Compare String

Posted: Tue Apr 30, 2013 1:21 am
by idle
If you're reading the strings in from two files it will be more effective to sort the items as you read in the second file
rather than sorting them out after you've loaded the files.

Do you just want a merged list or do you need to keep references of duplicates or unique items?

if you need a reference to a list of duplicates and unique items

Code: Select all

Global NewMap Ma.s() 
Global NewList list1() ;main list 
Global NewList ListDuplicate() ;duplicates 
Global NewList listUnique() ;unique 

Procedure Filllist1(s1.s) 
   AddElement(list1())
   If FindMapElement(Ma(),s1)
      list1() = @MA()
   Else    
     AddMapElement(Ma(),s1) 
     Ma() = s1 
     list1() = @ma()
   EndIf    
EndProcedure    

Procedure FillList2(s2.s) 
  
   If FindMapElement(ma(),s2)
      AddElement(listDuplicate()) 
      listDuplicate() = @MA()
   Else 
     AddMapElement(MA(),s2)
     Ma() = s2
     AddElement(listUnique())
     listunique() = @Ma() 
   EndIf 
EndProcedure 

Define a 

;read in file 1 
For a = 1 To 100 
   Filllist1("some string " + Str(a))    
Next 

;read in file 2 
For a = 50 To 150 
   FillList2("some string " + Str(a))
Next  

Define *ps.string 

ForEach listUnique() 
   *ps = listUnique() 
   Debug "unique "  + *ps\s 
Next 

or if you just want a combined list

Code: Select all

Global NewMap Ma.s() 
Global NewList listCombined() 

Procedure Filllist(s1.s) 
  If Not FindMapElement(Ma(),s1)
    AddMapElement(Ma(),s1) 
    AddElement(listCombined())
    Ma() = s1 
    listCombined() = @ma()
  EndIf    
EndProcedure    

Define a 

For a = 1 To 100 
   Filllist("some string " + Str(a))    
Next 

For a = 50 To 150 
   FillList("some string " + Str(a))
Next  

Define *ps.string 

Debug "the combined lists output" 

ForEach listCombined() 
   *ps = listCombined() 
   Debug *ps\s 
Next    


Re: Compare String

Posted: Tue Apr 30, 2013 6:44 am
by buddymatkona
@martin66119
Here is your CompareMemoryString snippet tweaked to summarize Strings_in_Array1_not_in_Array2 and vice versa using two different methods. idle's map should be faster. :)

Code: Select all

Na1 = 2000 : Na2 = 3004 : Na3 = Na1 + Na2 : Na4 = Na2
Dim Array1.s (Na1) : Dim Array2.s (Na2) : Dim Matches.i(Na4) : Dim Array3.s (Na3)

DefaultStr.s = "DefStr" : DiffStr.s = "DiffStr"
For i = 1 To Na1 
  Array1(i) = DefaultStr  ; Initialize
Next
For i = 1 To Na2 
  Array2(i) = DefaultStr  
Next

Array1(1) = DiffStr + "11" : Array1(Na1/2) = DiffStr + "100001" : Array1(Na1) = DiffStr + "200001" ; Set up mismatches to find
Array2(1) = DiffStr + "12" : Array2(Na2/2) = DiffStr + "100002" : Array2(Na2 -2) = DiffStr + "200002" : Array2(Na2) = DiffStr + "200022"

k = 1
For i = 1 To  Na1 ;                                                                      Test for differences
    *String1 = @Array1(i)
    Ifound = 0
    For j = 1 To  Na2
       *String2 = @Array2(j)
       Result = CompareMemoryString(*String1, *String2)
       If Result = #PB_String_Equal
          Matches(j) = 1
          Ifound = 1
       EndIf
    Next j
    If Ifound = 0
       Array3(k) = Array1(i) + " From Array1("+Str(i)+")" ;                     Save Strings_in_Array1_not_in_2
       k = k +1
    EndIf           
Next i
   
For i = 1 To Na2
  If Matches(i) = 0
     Array3(k) = Array2(i) + " From Array2("+Str(i)+")" ;                        Save Strings_in_Array2_not_in_1
     k = k +1
  EndIf
Next
   
For i = 1 To k-1
   Debug Array3(i)
Next

Re: Compare String

Posted: Tue Apr 30, 2013 12:18 pm
by Demivec
Here is a variation of idle's code that I think results in a speed improvement:

Code: Select all

;set number of simulated lines each file to be read
CompilerIf #PB_Compiler_Debugger 
  #tests = 100 ;limit the size for debugged output
CompilerElse
  #tests = 100000
CompilerEndIf 

;speed will increase for a larger number of slots in the map, but use more memory (as written it uses about 48k/map for 100000 lines in file)
Global NewMap map1.i(#tests / 8)  
Global NewMap noMatch(#tests / 8)

;if the strings are needed in the order they first occur in the file then set line = to the line where the string occurs
Procedure recordUniqueString(S1.s, line = 0) 
  If Not FindMapElement(map1(), S1)
    noMatch(S1) = line
  EndIf 
EndProcedure    

RandomSeed(0) ;ensure the results are for the same random collection each time, if comparing methods
Define a, t = ElapsedMilliseconds()

;read file 1 and add it to map 1 to keep a collection of unique strings for the file, each string is simulated
For a = 1 To #tests
  map1("some string " + Str(Random(a)))    
Next 

;read file 2 and create a list of only unique strings that aren't in file 1, each string is simulated
For a = 1 To #tests
  recordUniqueString("some string " + Str(Random(a)), a) 
Next  

Define output$ = "Lines in each file: " + Str(#tests) + #LF$
output$ + "Unique strings in file 1 = " + Str(MapSize(map1())) + #LF$
output$ + "Unique strings in file 2  not in file 1 = " + Str(MapSize(noMatch())) + #LF$ 
output$ + "Time: " + Str(ElapsedMilliseconds() - t)
MessageRequester("results",  output$)

;to display the strings which are unique retrieve the map keys of noMatch()
CompilerIf #PB_Compiler_Debugger
  Debug "unique lines in file 2, unsorted"
  ForEach noMatch()
    Debug MapKey(noMatch()) + " on line " + Str(noMatch())
  Next 
   
  ;sorted order
  Structure lineEntries
    s.s ;string
    n.i ;line number
  EndStructure

  NewList results.lineEntries()
  ForEach noMatch()
    AddElement(results()): results()\s = MapKey(noMatch()): results()\n = noMatch()
  Next
  
  SortStructuredList(results(), #PB_Sort_Ascending, OffsetOf(lineEntries\n), #PB_Sort_Integer)
  Debug "unique lines in file 2, sorted by line order"
  ForEach results()
    Debug results()\s + " on line " + Str(results()\n)
  Next 
CompilerEndIf
The main point on this code is that the strings don't need to be stored in the map because they are already stored as the key. If you wanted to stored something in the map then you might store the line number where it occurs. The elements could then be retrieved and placed in a list and sorted by the line# to get the results in order.

Re: Compare String

Posted: Wed May 01, 2013 1:49 pm
by martin66119
Hi,

thank you very much for your help and sorry for my late answer. But a had not time yesterday.
I have tried your proposal but they are not faster or I do not integrate the code not correctly in my code and therefore the proposal wasn´t faster. I suppose that the integration in my code (I have made) was not ok and therefore there is no improvement.

Here is my code a bit modified for testing.

Code: Select all

NewList Listing1.s()
NewList Listing2.s()

NewMap File1.s(1000000)
NewMap File2.s(1000000)
NewMap File3.s(1000000)
NewMap File4.s(1000000)

Count = 500000
For i = 1 To count
  Line.s = "This is the test string to verify the speed of different codes"
  AddMapElement(File1(), Line.s,#PB_Map_ElementCheck) 
  AddElement(Listing1())
  Listing1() = Line.s
Next i

For i = 1 To count
  Line.s = "This is the test string to verify the speed of different codes" + Str(Random(10))
  AddMapElement(File2(), Line.s,#PB_Map_ElementCheck) 
  AddElement(Listing2())
  Listing2() = Line.s
Next i

;-----------------------------------
MessageRequester ("","Start")

            StartTime = ElapsedMilliseconds() 
            
            ;-----------------------------------
            ForEach Listing1()
              If FindMapElement(File2(), Listing1())
            
              Else
                AddMapElement(File3(), Listing1(),#PB_Map_ElementCheck)
              EndIf  
             Next

            ;--------------------------------
           
             ForEach Listing2()            
                If FindMapElement(File1(), Listing2()) 
                Else
                  AddMapElement(File4(), Listing2(),#PB_Map_ElementCheck)
                  EndIf              
               Next                         
            ;---------------------------------            
            
            Time =ElapsedMilliseconds() - StartTime
            CountFile3 = MapSize(File3())
            CountFile4 = MapSize(File4())
            
            MessageRequester("Execution Time", "Time: " + Str(Time) + " ms")
            MessageRequester("File3", "File3: " + Str(CountFile3) + "  File4: " +Str(CountFile4))


Re: Compare String

Posted: Wed May 01, 2013 9:16 pm
by martin66119
Hi again,

in following code I use FindMapElement(....) to find the corresponding strings
but I like to use "comparememorystring" in the relevant loops. I do not know to change the code taht it works using "comparememorystring". Please could you helb me.

Thank you very much in advance

Code: Select all

ewList Listing1.s()
NewList Listing2.s()

NewMap File1.s(1000000)
NewMap File2.s(1000000)
NewMap File3.s(1000000)
NewMap File4.s(1000000)

Count = 500000
For i = 1 To count
  Line.s = "This is the test string to verify the speed of different codes"
  AddMapElement(File1(), Line.s,#PB_Map_ElementCheck)
  AddElement(Listing1())
  Listing1() = Line.s
Next i

For i = 1 To count
  Line.s = "This is the test string to verify the speed of different codes" + Str(Random(10))
  AddMapElement(File2(), Line.s,#PB_Map_ElementCheck)
  AddElement(Listing2())
  Listing2() = Line.s
Next i

;-----------------------------------
MessageRequester ("","Start")

            StartTime = ElapsedMilliseconds()
           
            ;-----------------------------------
            ForEach Listing1()
              If FindMapElement(File2(), Listing1())
           
              Else
                AddMapElement(File3(), Listing1(),#PB_Map_ElementCheck)
              EndIf 
             Next

            ;--------------------------------
           
             ForEach Listing2()           
                If FindMapElement(File1(), Listing2())
                Else
                  AddMapElement(File4(), Listing2(),#PB_Map_ElementCheck)
                  EndIf             
               Next                         
            ;---------------------------------           
           
            Time =ElapsedMilliseconds() - StartTime
            CountFile3 = MapSize(File3())
            CountFile4 = MapSize(File4())
           
            MessageRequester("Execution Time", "Time: " + Str(Time) + " ms")
            MessageRequester("File3", "File3: " + Str(CountFile3) + "  File4: " +Str(CountFile4))