Remove all duplicate lines from file

JaxMusic · Post by **JaxMusic** » Sun Feb 21, 2021 4:09 pm

Why not drop into a database table? You can easily create a temp table, insert the data, and retrieve the unique list, sorted if you wish:

Code: Select all

Select distinct name from tmpname order by name

With Pure Basics DB library, this should be much fewer lines of code and more clear what is being done. Of course, it never hurts to look at the tools PB provides, but databases are made to solve problems like these

Paul · Post by **Paul** » Sun Feb 21, 2021 7:29 pm

Remember.... the original post wanted the results to show:

Tom
Barbara
Tim
Antonia

Many examples posted here have "Frederic" and "Antonio" in the results which would be incorrect.
According to original post, if there is a duplicate name then the name must not appear in the list at all.

Here's an example using Lists (assuming the data is in a file called data.txt)

Code: Select all

NewList dat.s()

hFile=ReadFile(#PB_Any,"data.txt")
If hFile
  While Eof(hFile)=0
    AddElement(dat())
    dat()=ReadString(hFile)
  Wend
  CloseFile(hFile)  
  
  ResetList(dat())
  While NextElement(dat())
    *old=@dat()
    cur$=dat()
    found=0
    ForEach dat()
      If dat()=cur$
        found+1
      EndIf
    Next
    If found>1
      ForEach dat()
        If dat()=cur$
          DeleteElement(dat())
        EndIf
      Next
    EndIf
    ChangeCurrentElement(dat(), *old)
  Wend
EndIf

ForEach dat()
  Debug dat()
Next

Demivec · Post by **Demivec** » Mon Feb 22, 2021 1:14 am

@Paul: Your code solution has a error in its implementation.
You record the address of the current list element in *old and its contents in cur$ then you search the entire list to count the number of elements with whose contents match cur$. If there is more than one you then go through the entire list again and delete all the elements that match cur$ including the original one whose address you saved in *old. You then change the current list element to be the one pointed to by *old. If you had just finished checking for an element that was duplicated then *old doesn't point to a valid element because you deleted it.
One possible correction is to keep track of the previous element instead of the current element.

Paul · Post by **Paul** » Mon Feb 22, 2021 4:37 am

Demivec wrote:@Paul: Your code solution has a error in its implementation.

Ok, so provide a list of data which causes this to fail

Demivec · Post by **Demivec** » Mon Feb 22, 2021 11:43 pm

@Paul: Well I have to admit I could not come up with a list of data that causes it to fail. It does not fail ... yet.

@Edit: removed documentation of flawed code

Paul · Post by **Paul** » Tue Feb 23, 2021 2:37 am

So key takeaway...

@Paul: Well I have to admit I could not come up with a list of data that causes it to fail. It does not fail...

Keya · Post by **Keya** » Wed Feb 24, 2021 6:26 am

Depending on how large the file was, I would store say n-bit hashes for each line. If it was a tiny file I'd use 8-bit hashes (although you dont really need it), if it was a large file I'd use 24 or 32-bit hashes

kenmo · Post by **kenmo** » Wed Feb 24, 2021 1:29 pm

Here's my contribution

If the names are in a file, just do two passes over the file:

Code: Select all

NewMap Count.i()
ReadFile(0, "names.txt")
  While Not Eof(0)
    Name.s = ReadString(0)
    Count(Name) + 1
  Wend
  
  FileSeek(0, 0)
  While Not Eof(0)
    Name.s = ReadString(0)
    If Count(Name) = 1
      Debug Name
    EndIf
  Wend
CloseFile(0)

Of course, this quick example doesn't handle file errors, blank lines, or case-insensitive. Easy to add.

PureBasic Forums - English

Remove all duplicate lines from file

Re: Remove all duplicate lines from file

Re: Remove all duplicate lines from file

Re: Remove all duplicate lines from file

Re: Remove all duplicate lines from file

Re: Remove all duplicate lines from file

Re: Remove all duplicate lines from file

Re: Remove all duplicate lines from file

Re: Remove all duplicate lines from file