Page 2 of 2

Re: Remove all duplicate lines from file

Posted: Sun Feb 21, 2021 4:09 pm
by JaxMusic
Why not drop into a database table? You can easily create a temp table, insert the data, and retrieve the unique list, sorted if you wish:

Code: Select all

Select distinct name from tmpname order by name
With Pure Basics DB library, this should be much fewer lines of code and more clear what is being done. Of course, it never hurts to look at the tools PB provides, but databases are made to solve problems like these

Re: Remove all duplicate lines from file

Posted: Sun Feb 21, 2021 7:29 pm
by Paul
Remember.... the original post wanted the results to show:
Tom
Barbara
Tim
Antonia
Many examples posted here have "Frederic" and "Antonio" in the results which would be incorrect.
According to original post, if there is a duplicate name then the name must not appear in the list at all.


Here's an example using Lists (assuming the data is in a file called data.txt)

Code: Select all

NewList dat.s()

hFile=ReadFile(#PB_Any,"data.txt")
If hFile
  While Eof(hFile)=0
    AddElement(dat())
    dat()=ReadString(hFile)
  Wend
  CloseFile(hFile)  
  
  ResetList(dat())
  While NextElement(dat())
    *old=@dat()
    cur$=dat()
    found=0
    ForEach dat()
      If dat()=cur$
        found+1
      EndIf
    Next
    If found>1
      ForEach dat()
        If dat()=cur$
          DeleteElement(dat())
        EndIf
      Next
    EndIf
    ChangeCurrentElement(dat(), *old)
  Wend
EndIf

ForEach dat()
  Debug dat()
Next

Re: Remove all duplicate lines from file

Posted: Mon Feb 22, 2021 1:14 am
by Demivec
@Paul: Your code solution has a error in its implementation.
You record the address of the current list element in *old and its contents in cur$ then you search the entire list to count the number of elements with whose contents match cur$. If there is more than one you then go through the entire list again and delete all the elements that match cur$ including the original one whose address you saved in *old. You then change the current list element to be the one pointed to by *old. If you had just finished checking for an element that was duplicated then *old doesn't point to a valid element because you deleted it.
One possible correction is to keep track of the previous element instead of the current element.

Re: Remove all duplicate lines from file

Posted: Mon Feb 22, 2021 4:37 am
by Paul
Demivec wrote:@Paul: Your code solution has a error in its implementation.
Ok, so provide a list of data which causes this to fail :)

Re: Remove all duplicate lines from file

Posted: Mon Feb 22, 2021 11:43 pm
by Demivec
@Paul: Well I have to admit I could not come up with a list of data that causes it to fail. It does not fail ... yet.

@Edit: removed documentation of flawed code

Re: Remove all duplicate lines from file

Posted: Tue Feb 23, 2021 2:37 am
by Paul
So key takeaway...
@Paul: Well I have to admit I could not come up with a list of data that causes it to fail. It does not fail...
:P

Re: Remove all duplicate lines from file

Posted: Wed Feb 24, 2021 6:26 am
by Keya
Depending on how large the file was, I would store say n-bit hashes for each line. If it was a tiny file I'd use 8-bit hashes (although you dont really need it), if it was a large file I'd use 24 or 32-bit hashes

Re: Remove all duplicate lines from file

Posted: Wed Feb 24, 2021 1:29 pm
by kenmo
Here's my contribution :lol:

If the names are in a file, just do two passes over the file:

Code: Select all

NewMap Count.i()
ReadFile(0, "names.txt")
  While Not Eof(0)
    Name.s = ReadString(0)
    Count(Name) + 1
  Wend
  
  FileSeek(0, 0)
  While Not Eof(0)
    Name.s = ReadString(0)
    If Count(Name) = 1
      Debug Name
    EndIf
  Wend
CloseFile(0)

Of course, this quick example doesn't handle file errors, blank lines, or case-insensitive. Easy to add.