Page 1 of 2

Remove all duplicate lines from file

Posted: Thu Feb 18, 2021 7:42 pm
by StarWarsFan
I am thinking my way around how to solve this. Let us assume I got a simple text file that shows

Code: Select all

Tom
Barbara
Tim
Antonio
Frederic
Frederic
Frederic
Antonia
Antonio
What would be the most elegant way to remove all duplicate lines so the result shows

Code: Select all

Tom
Barbara
Tim
Antonia
so 'Frederic' 3 times removed and 'Antonip' 2 times removed.

Or is there a module available that does that already?

Greetings to all here!

Re: Remove all duplicate lines from file

Posted: Thu Feb 18, 2021 7:56 pm
by Saki
Hi, look for "Map" in the Handbook

Re: Remove all duplicate lines from file

Posted: Thu Feb 18, 2021 7:59 pm
by chi
If you don't care about the order of the names, you could use a Map

Re: Remove all duplicate lines from file

Posted: Thu Feb 18, 2021 8:02 pm
by Mr.L

Code: Select all

Global NewMap Name.b()

Procedure AddName(name.s)
	If FindMapElement(Name(), name) = 0
		Name(name) = 1
	EndIf
EndProcedure

AddName("Tom")
AddName("Barbara")
AddName("Tim")
AddName("Antonio")
AddName("Frederic")
AddName("Frederic")
AddName("Frederic")
AddName("Antonio")
AddName("Antonio")

ForEach Name()
	Debug MapKey(Name())
Next
As chi stated, Maps are unordered but there are ways to solve that.

Re: Remove all duplicate lines from file

Posted: Thu Feb 18, 2021 8:10 pm
by #NULL

Code: Select all

NewMap names.i()

names("Tom") + 1
names("Barbara") + 1
names("Tim") + 1
names("Antonio") + 1
names("Frederic") + 1
names("Frederic") + 1
names("Frederic") + 1
names("Antonia") + 1
names("Antonio") + 1

ForEach names()
  If names() > 1
    DeleteMapElement(names())
  EndIf
Next

ForEach names()
  Debug MapKey(names())
Next

;   Barbara
;   Tom
;   Tim
;   Antonia

Re: Remove all duplicate lines from file

Posted: Thu Feb 18, 2021 8:20 pm
by Kiffi

Code: Select all

Global NewMap Name.s()

Procedure AddName(Name.s)
  Name(Name) = Name
EndProcedure

AddName("Tom")
AddName("Barbara")
AddName("Tim")
AddName("Antonio")
AddName("Frederic")
AddName("Frederic")
AddName("Frederic")
AddName("Antonia")
AddName("Antonio")

ForEach Name()
  Debug Name()
Next

Re: Remove all duplicate lines from file

Posted: Thu Feb 18, 2021 8:33 pm
by AZJIO

Code: Select all

EnableExplicit


Declare SplitL(String.s, List StringList.s(), Separator.s = " ")

Define Text$

Text$ = GetClipboardText()


Global NewList aText.s()
SplitL(Text$, aText(), #LF$)

; ForEach aText()
; 	Debug aText()
; Next

NewMap uni.s()

ForEach aText()
	AddMapElement(uni(), aText(),  #PB_Map_ElementCheck)
Next

ForEach uni()
    Debug MapKey(uni())
Next



; wilbert
; https://www.purebasic.fr/english/viewtopic.php?p=486382#p486382
Procedure SplitL(String.s, List StringList.s(), Separator.s = " ")
 
  Protected S.String, *S.Integer = @S
  Protected.i p, slen
  slen = Len(Separator)
  ClearList(StringList())
 
  *S\i = @String
  Repeat
    AddElement(StringList())
    p = FindString(S\s, Separator)
    StringList() = PeekS(*S\i, p - 1)
    *S\i + (p + slen - 1) << #PB_Compiler_Unicode
  Until p = 0
  *S\i = 0
 
EndProcedure

Re: Remove all duplicate lines from file

Posted: Fri Feb 19, 2021 2:56 am
by BarryG
[Deleted, the request was a bit misleading in its wording]

Re: Remove all duplicate lines from file

Posted: Fri Feb 19, 2021 5:31 am
by TI-994A
StarWarsFan wrote:...assume I got a simple text file that shows

Code: Select all

Tom
Barbara
Tim
Antonio
Frederic
Frederic
Frederic
Antonia
Antonio
...most elegant way to remove all duplicate lines so the result shows

Code: Select all

Tom
Barbara
Tim
Antonia
Here's one way:

Code: Select all

Define.s record, records, duplicate

; === create sample list from data ===

For i = 0 To 8
  Read.s record
  records + record + ","
Next i

DataSection
  Data.s "Tom", "Barbara", "Tim", "Antonio", "Frederic"
  Data.s "Frederic", "Frederic", "Antonia", "Antonio"
EndDataSection

; === sort and expunge duplicates ===

For i = 1 To CountString(records, ",") + 1
  duplicate = StringField(records, i, ",")
  If CountString(records, duplicate) > 1
    records = RemoveString(records, duplicate + ",")      
  EndIf    
Next i  

; === display trimmed results ===

For i = 1 To CountString(records, ",") + 1
  Debug StringField(records, i, ",")
Next i
And the same approach, for data read from file:

Code: Select all

Define.s record, records, duplicate

; === read records from file ===

If ReadFile(0, "duplicates.txt")
  
  While Not Eof(0)
    records + ReadString(0) + ","
  Wend
  CloseFile(0)  
  
  ; === sort and expunge duplicates ===
  
  For i = 1 To CountString(records, ",") + 1
    duplicate = StringField(records, i, ",")
    If CountString(records, duplicate) > 1
      records = RemoveString(records, duplicate + ",")      
    EndIf    
  Next i  
  
  ; === display trimmed results ===
    
  For i = 1 To CountString(records, ",") + 1
    Debug StringField(records, i, ",")
  Next i    
  
EndIf

Re: Remove all duplicate lines from file

Posted: Fri Feb 19, 2021 5:36 am
by BarryG
TI-994A, yours has a bug: it doesn't keep Frederic in the list.

Re: Remove all duplicate lines from file

Posted: Fri Feb 19, 2021 6:06 am
by TI-994A
BarryG wrote:TI-994A, yours has a bug: it doesn't keep Frederic in the list.
That's the intention of the OP, I believe.
StarWarsFan wrote:What would be the most elegant way to remove all duplicate lines so the result shows

Code: Select all

Tom
Barbara
Tim
Antonia
so 'Frederic' 3 times removed and 'Antonip' (should be Antonio) 2 times removed.

Re: Remove all duplicate lines from file

Posted: Fri Feb 19, 2021 7:08 am
by BarryG
Oh, I didn't notice that - sorry! I thought the OP actually wanted 1 x Frederic in the list because he said "no duplicates" - having one of something isn't a duplicate.

Re: Remove all duplicate lines from file

Posted: Fri Feb 19, 2021 8:16 am
by TI-994A
BarryG wrote:...I thought the OP actually wanted 1 x Frederic in the list because he said "no duplicates" - having one of something isn't a duplicate.
My sentiments exactly. Especially with the Antonia/Antonio confusion. :lol:

Re: Remove all duplicate lines from file

Posted: Fri Feb 19, 2021 10:42 am
by mk-soft
I think that Maps is not the right choice here, because you lose the order and the lines can be very long. Even if it may take a little longer, I find lists better here.

Code: Select all

Global NewList Rows.s()
Global text.s, found

; Load data
Restore Lines
Repeat
  Read.s text
  If text = #ETX$
    Break
  EndIf
  ForEach Rows()
    If Rows() = text
      found = #True
      Break
    EndIf
  Next
  If found
    found = #False
  Else
    AddElement(Rows())
    Rows() = text
  EndIf
ForEver

; Output data
ForEach Rows()
  Debug Rows()
Next

DataSection
  Lines:
  Data.s "Tom"
  Data.s "Barbara"
  Data.s "Tim"
  Data.s "Antonio"
  Data.s "Frederic"
  Data.s "Frederic"
  Data.s "Frederic"
  Data.s "Antonia"
  Data.s "Antonio"
  Data.s #ETX$
EndDataSection

Re: Remove all duplicate lines from file

Posted: Sun Feb 21, 2021 2:23 am
by Demivec
Here's mine, for fun. It uses a map to remove duplicates and a list to output the results in the same line order as the original file. It also allows saving of the result file to a new file if desired:

Code: Select all

EnableExplicit
CompilerIf #PB_Compiler_Debugger =  0
  MessageRequester("Info", "Needs to be compiled with Debugger to show debug results." + #LF$ + "Save to File still possible.")
CompilerEndIf

Define pattern$ = "All supported formats|*.*;*.txt;", initpath$ = GetHomeDirectory()

NewMap textLines()
Define file$, fileID, lineNumber, textLine$

file$ = OpenFileRequester("Choose text file to process", initpath$ + "*.*", pattern$, 0)
If file$ 
  fileID = 0
  If ReadFile(fileID, file$)
    lineNumber = 0
    While Not Eof(FileID)
      lineNumber + 1
      textLine$ = ReadString(fileID)
      If FindMapElement(textLines(), textLine$) = #False
        textLines(textLine$) = lineNumber
      Else
        textLines(textLine$) = -1 ;signal to remove from list
      EndIf
    Wend
    CloseFile(FileID)
    
    Structure outputLine
      text.s
      lineNumber.i
    EndStructure
    
    NewList outputLines.outputLine()
    
    ForEach textLines()
      If textLines() > 0
        AddElement(outputLines())
        outputLines()\text = MapKey(textLines())
        outputLines()\lineNumber = textLines()
      EndIf
    Next
    
    ClearMap(textLines()) ;just a some tidying up to free up this copy of data
    SortStructuredList(outputLines(), #PB_Sort_Ascending, OffsetOf(outputLine\lineNumber), #PB_Integer)
    CompilerIf #PB_Compiler_Debugger
      ForEach outputLines()
        Debug outputLines()\text
      Next
    CompilerEndIf
    MessageRequester("Results", "Number of lines read from file : " + lineNumber + #LF$ +
                                "Number of non-duplicate lines: " + ListSize(outputLines()))
  EndIf
  
  Define answer$, sFile$, saveFileID
  answer$ = InputRequester("Results", "Save results to a new file (Y/N)?", "No")
  If UCase(Left(Trim(answer$), 1)) = "Y"
    sFile$ = SaveFileRequester("Please choose file to save to", initpath$ + "", pattern$, 0)
    Trim(sFile$)
    If sFile$
      If FileSize(sFile$) <> -1
        MessageRequester("Error", "Unable to create save file. File already exists or is a folder.")
        End ;more extensive recover could be implemented here but for demo this is enough
      EndIf
      
      saveFileID = OpenFile(#PB_Any, sFile$)
      If saveFileID
        ForEach outputLines()
          WriteStringN(saveFileID, outputLines()\text)
        Next
        CloseFile(saveFileID)
      EndIf
    EndIf
  EndIf
EndIf