Remove all duplicate lines from file

Just starting out? Need help? Post your questions and find answers here.
JaxMusic
User
User
Posts: 20
Joined: Sun Feb 14, 2021 2:55 am
Location: Max Meadows, Virginia, USA
Contact:

Re: Remove all duplicate lines from file

Post by JaxMusic »

Why not drop into a database table? You can easily create a temp table, insert the data, and retrieve the unique list, sorted if you wish:

Code: Select all

Select distinct name from tmpname order by name
With Pure Basics DB library, this should be much fewer lines of code and more clear what is being done. Of course, it never hurts to look at the tools PB provides, but databases are made to solve problems like these
User avatar
Paul
PureBasic Expert
PureBasic Expert
Posts: 1282
Joined: Fri Apr 25, 2003 4:34 pm
Location: Canada
Contact:

Re: Remove all duplicate lines from file

Post by Paul »

Remember.... the original post wanted the results to show:
Tom
Barbara
Tim
Antonia
Many examples posted here have "Frederic" and "Antonio" in the results which would be incorrect.
According to original post, if there is a duplicate name then the name must not appear in the list at all.


Here's an example using Lists (assuming the data is in a file called data.txt)

Code: Select all

NewList dat.s()

hFile=ReadFile(#PB_Any,"data.txt")
If hFile
  While Eof(hFile)=0
    AddElement(dat())
    dat()=ReadString(hFile)
  Wend
  CloseFile(hFile)  
  
  ResetList(dat())
  While NextElement(dat())
    *old=@dat()
    cur$=dat()
    found=0
    ForEach dat()
      If dat()=cur$
        found+1
      EndIf
    Next
    If found>1
      ForEach dat()
        If dat()=cur$
          DeleteElement(dat())
        EndIf
      Next
    EndIf
    ChangeCurrentElement(dat(), *old)
  Wend
EndIf

ForEach dat()
  Debug dat()
Next
Image Image
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Remove all duplicate lines from file

Post by Demivec »

@Paul: Your code solution has a error in its implementation.
You record the address of the current list element in *old and its contents in cur$ then you search the entire list to count the number of elements with whose contents match cur$. If there is more than one you then go through the entire list again and delete all the elements that match cur$ including the original one whose address you saved in *old. You then change the current list element to be the one pointed to by *old. If you had just finished checking for an element that was duplicated then *old doesn't point to a valid element because you deleted it.
One possible correction is to keep track of the previous element instead of the current element.
User avatar
Paul
PureBasic Expert
PureBasic Expert
Posts: 1282
Joined: Fri Apr 25, 2003 4:34 pm
Location: Canada
Contact:

Re: Remove all duplicate lines from file

Post by Paul »

Demivec wrote:@Paul: Your code solution has a error in its implementation.
Ok, so provide a list of data which causes this to fail :)
Image Image
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Remove all duplicate lines from file

Post by Demivec »

@Paul: Well I have to admit I could not come up with a list of data that causes it to fail. It does not fail ... yet.

@Edit: removed documentation of flawed code
Last edited by Demivec on Tue Feb 23, 2021 6:27 am, edited 1 time in total.
User avatar
Paul
PureBasic Expert
PureBasic Expert
Posts: 1282
Joined: Fri Apr 25, 2003 4:34 pm
Location: Canada
Contact:

Re: Remove all duplicate lines from file

Post by Paul »

So key takeaway...
@Paul: Well I have to admit I could not come up with a list of data that causes it to fail. It does not fail...
:P
Image Image
User avatar
Keya
Addict
Addict
Posts: 1890
Joined: Thu Jun 04, 2015 7:10 am

Re: Remove all duplicate lines from file

Post by Keya »

Depending on how large the file was, I would store say n-bit hashes for each line. If it was a tiny file I'd use 8-bit hashes (although you dont really need it), if it was a large file I'd use 24 or 32-bit hashes
User avatar
kenmo
Addict
Addict
Posts: 2033
Joined: Tue Dec 23, 2003 3:54 am

Re: Remove all duplicate lines from file

Post by kenmo »

Here's my contribution :lol:

If the names are in a file, just do two passes over the file:

Code: Select all

NewMap Count.i()
ReadFile(0, "names.txt")
  While Not Eof(0)
    Name.s = ReadString(0)
    Count(Name) + 1
  Wend
  
  FileSeek(0, 0)
  While Not Eof(0)
    Name.s = ReadString(0)
    If Count(Name) = 1
      Debug Name
    EndIf
  Wend
CloseFile(0)

Of course, this quick example doesn't handle file errors, blank lines, or case-insensitive. Easy to add.
Post Reply