Page 1 of 1

Cleaning a txt file from dupes ...

Posted: Fri Oct 08, 2004 2:54 pm
by thyr0x1ne
Im looking for a way to clean a txt file from all the dupes inside

For example transform :
../file.txt
amero
ameri
amerp
amero
amerp
amera

in ../file.txt
ameri
amera

so removing amerp / amero

heres a part of code i wrote :

Code: Select all

Procedure CleanText()
 
 totalstring.l=CountString(textdata,".") + 1  
 ReadFile(1,"file.txt")
 
 While Eof(1) = 0
  text$=text$+ReadString()
 
  For K = 1 To totalstring.l
  
  If FindString(text$,StringField(textdata, K, "."),1) = 1 : Break : EndIf
  If FindString(text$,StringField(textdata, K, "."),1) = 0  
   UseFile(1)
   WriteStringN(StringField(textdata, K, ".")) 
  EndIf
   
  Next K

 Wend
EndProcedure
i was hoping to add a textdata ( parsed by "." as strings ) in a file.txt ONLY if FindString return 0 ( meaning the textdata not yet on file.txt ) , but for now all is written , even dupe words ...

have someone a hint ( lib or routine ) to clean a text file like this , or stop the write process of a string if a line in this txt allready contains the specified string ?

thks for any help

Posted: Fri Oct 08, 2004 4:11 pm
by wilbert
I would break up the text into an array of words, sort the array and walk through the sorted array. If you compare each word of the sorted array with the previous one and the next one and both don't match, you have a unique word that you can write into a new textfile.

Posted: Fri Oct 08, 2004 8:58 pm
by thyr0x1ne
damn :/ i cant find the right way ... i used Array as you said , but for large files the exe i did freeze like hell :/

i need a good sleep i think , will lose more neurons tomorrow for now :)

btw thks for your reply

Posted: Fri Oct 08, 2004 11:02 pm
by Beach
I'm assuming your data is separated by periods since your code has it listed as such. Take a look at the following code and see if it could work for you.

Code: Select all

;FileHnd = ReadFile(#PB_Any,"file.txt")
;While Eof(FileHnd) = #False : fileData$ + ReadString() : Wend
;CloseFile(FileHnd)

fileData$ = "amero.ameri."
fileData$ + "amerp.amero."
fileData$ + "amerp.amera"

TotalItems = CountString(fileData$,".")
Dim ItemData.s(TotalItems)

For i = 1 To TotalItems+1
  ItemData(i-1) = StringField(fileData$,i,".")
Next

dup.b

For i = 0 To TotalItems
  CurrentItem$ = ItemData(i)
  For x = 0 To TotalItems
    If CurrentItem$ = ItemData(x) And x <> i
      dup = #True
    EndIf
  Next x  
  
  If dup = #False
    newdata$ = newdata$ + CurrentItem$ + "."
  EndIf
  
  dup = #False
Next i

Debug newdata$
EDIT:
After reading your post again I now see that you want to prevent the item from being written in the text file if it already exists in the file. I'm not sure how big your file is - but I would think this method would not be very efficient. I would consider using a database for this. SQLite would work very well for this situation.

Posted: Sat Oct 09, 2004 6:46 pm
by thyr0x1ne
I decided to make a procedure removing dupe lines *after* the file creation ; here's a working code but of course there must be some limitations ( size of text file certainly ...)

Code: Select all

Procedure RemDupes()

 OpenFile(1,"range.dic") 

 While Eof(1) = 0
   text$=text$+Trim(ReadString())+Chr(42)
 Wend  
   
 textstring.l=CountString(text$,Chr(42))+1
 
 For i=1 To textstring.l
  a$=StringField(text$, i, Chr(42))
  If CountString(text$,a$) > 0  
    clean$=clean$+a$+Chr(42)
    text$=RemoveString(text$,a$+Chr(42))
  EndIf  
 Next
 
 cleanstring.l=CountString(clean$,Chr(42))+1
 
 CloseFile(1)
 
 CopyFile("range.dic", "range.dic.BAK") 
 DeleteFile("range.dic")
 OpenFile(1,"range.dic")
 
 For j=1 To cleanstring.l
  If Len(StringField(clean$, j, Chr(42))) > 0
   WriteStringN(StringField(clean$, j, Chr(42)))
  EndIf 
 Next
 
 CloseFile(1)

EndProcedure

RemDupes()

End
By the way ty for your hint , i saved it to another one of my millions part of source codes :)