Remove all duplicate lines from file

Just starting out? Need help? Post your questions and find answers here.
StarWarsFan
Enthusiast
Enthusiast
Posts: 169
Joined: Sat Mar 14, 2015 11:53 am

Remove all duplicate lines from file

Post by StarWarsFan »

I am thinking my way around how to solve this. Let us assume I got a simple text file that shows

Code: Select all

Tom
Barbara
Tim
Antonio
Frederic
Frederic
Frederic
Antonia
Antonio
What would be the most elegant way to remove all duplicate lines so the result shows

Code: Select all

Tom
Barbara
Tim
Antonia
so 'Frederic' 3 times removed and 'Antonip' 2 times removed.

Or is there a module available that does that already?

Greetings to all here!
Image - There is usually a lot of "try this, maybe do that" but ONLY an example that one can test for themself and get an immediate result actually brings people forward.
User avatar
Saki
Addict
Addict
Posts: 830
Joined: Sun Apr 05, 2020 11:28 am
Location: Pandora

Re: Remove all duplicate lines from file

Post by Saki »

Hi, look for "Map" in the Handbook
地球上の平和
User avatar
chi
Addict
Addict
Posts: 1087
Joined: Sat May 05, 2007 5:31 pm
Location: Austria

Re: Remove all duplicate lines from file

Post by chi »

If you don't care about the order of the names, you could use a Map
Et cetera is my worst enemy
Mr.L
Enthusiast
Enthusiast
Posts: 146
Joined: Sun Oct 09, 2011 7:39 am

Re: Remove all duplicate lines from file

Post by Mr.L »

Code: Select all

Global NewMap Name.b()

Procedure AddName(name.s)
	If FindMapElement(Name(), name) = 0
		Name(name) = 1
	EndIf
EndProcedure

AddName("Tom")
AddName("Barbara")
AddName("Tim")
AddName("Antonio")
AddName("Frederic")
AddName("Frederic")
AddName("Frederic")
AddName("Antonio")
AddName("Antonio")

ForEach Name()
	Debug MapKey(Name())
Next
As chi stated, Maps are unordered but there are ways to solve that.
#NULL
Addict
Addict
Posts: 1497
Joined: Thu Aug 30, 2007 11:54 pm
Location: right here

Re: Remove all duplicate lines from file

Post by #NULL »

Code: Select all

NewMap names.i()

names("Tom") + 1
names("Barbara") + 1
names("Tim") + 1
names("Antonio") + 1
names("Frederic") + 1
names("Frederic") + 1
names("Frederic") + 1
names("Antonia") + 1
names("Antonio") + 1

ForEach names()
  If names() > 1
    DeleteMapElement(names())
  EndIf
Next

ForEach names()
  Debug MapKey(names())
Next

;   Barbara
;   Tom
;   Tim
;   Antonia
User avatar
Kiffi
Addict
Addict
Posts: 1485
Joined: Tue Mar 02, 2004 1:20 pm
Location: Amphibios 9

Re: Remove all duplicate lines from file

Post by Kiffi »

Code: Select all

Global NewMap Name.s()

Procedure AddName(Name.s)
  Name(Name) = Name
EndProcedure

AddName("Tom")
AddName("Barbara")
AddName("Tim")
AddName("Antonio")
AddName("Frederic")
AddName("Frederic")
AddName("Frederic")
AddName("Antonia")
AddName("Antonio")

ForEach Name()
  Debug Name()
Next
Hygge
AZJIO
Addict
Addict
Posts: 2143
Joined: Sun May 14, 2017 1:48 am

Re: Remove all duplicate lines from file

Post by AZJIO »

Code: Select all

EnableExplicit


Declare SplitL(String.s, List StringList.s(), Separator.s = " ")

Define Text$

Text$ = GetClipboardText()


Global NewList aText.s()
SplitL(Text$, aText(), #LF$)

; ForEach aText()
; 	Debug aText()
; Next

NewMap uni.s()

ForEach aText()
	AddMapElement(uni(), aText(),  #PB_Map_ElementCheck)
Next

ForEach uni()
    Debug MapKey(uni())
Next



; wilbert
; https://www.purebasic.fr/english/viewtopic.php?p=486382#p486382
Procedure SplitL(String.s, List StringList.s(), Separator.s = " ")
 
  Protected S.String, *S.Integer = @S
  Protected.i p, slen
  slen = Len(Separator)
  ClearList(StringList())
 
  *S\i = @String
  Repeat
    AddElement(StringList())
    p = FindString(S\s, Separator)
    StringList() = PeekS(*S\i, p - 1)
    *S\i + (p + slen - 1) << #PB_Compiler_Unicode
  Until p = 0
  *S\i = 0
 
EndProcedure
BarryG
Addict
Addict
Posts: 4128
Joined: Thu Apr 18, 2019 8:17 am

Re: Remove all duplicate lines from file

Post by BarryG »

[Deleted, the request was a bit misleading in its wording]
Last edited by BarryG on Tue Feb 23, 2021 3:39 am, edited 1 time in total.
User avatar
TI-994A
Addict
Addict
Posts: 2700
Joined: Sat Feb 19, 2011 3:47 am
Location: Singapore
Contact:

Re: Remove all duplicate lines from file

Post by TI-994A »

StarWarsFan wrote:...assume I got a simple text file that shows

Code: Select all

Tom
Barbara
Tim
Antonio
Frederic
Frederic
Frederic
Antonia
Antonio
...most elegant way to remove all duplicate lines so the result shows

Code: Select all

Tom
Barbara
Tim
Antonia
Here's one way:

Code: Select all

Define.s record, records, duplicate

; === create sample list from data ===

For i = 0 To 8
  Read.s record
  records + record + ","
Next i

DataSection
  Data.s "Tom", "Barbara", "Tim", "Antonio", "Frederic"
  Data.s "Frederic", "Frederic", "Antonia", "Antonio"
EndDataSection

; === sort and expunge duplicates ===

For i = 1 To CountString(records, ",") + 1
  duplicate = StringField(records, i, ",")
  If CountString(records, duplicate) > 1
    records = RemoveString(records, duplicate + ",")      
  EndIf    
Next i  

; === display trimmed results ===

For i = 1 To CountString(records, ",") + 1
  Debug StringField(records, i, ",")
Next i
And the same approach, for data read from file:

Code: Select all

Define.s record, records, duplicate

; === read records from file ===

If ReadFile(0, "duplicates.txt")
  
  While Not Eof(0)
    records + ReadString(0) + ","
  Wend
  CloseFile(0)  
  
  ; === sort and expunge duplicates ===
  
  For i = 1 To CountString(records, ",") + 1
    duplicate = StringField(records, i, ",")
    If CountString(records, duplicate) > 1
      records = RemoveString(records, duplicate + ",")      
    EndIf    
  Next i  
  
  ; === display trimmed results ===
    
  For i = 1 To CountString(records, ",") + 1
    Debug StringField(records, i, ",")
  Next i    
  
EndIf
Texas Instruments TI-99/4A Home Computer: the first home computer with a 16bit processor, crammed into an 8bit architecture. Great hardware - Poor design - Wonderful BASIC engine. And it could talk too! Please visit my YouTube Channel :D
BarryG
Addict
Addict
Posts: 4128
Joined: Thu Apr 18, 2019 8:17 am

Re: Remove all duplicate lines from file

Post by BarryG »

TI-994A, yours has a bug: it doesn't keep Frederic in the list.
User avatar
TI-994A
Addict
Addict
Posts: 2700
Joined: Sat Feb 19, 2011 3:47 am
Location: Singapore
Contact:

Re: Remove all duplicate lines from file

Post by TI-994A »

BarryG wrote:TI-994A, yours has a bug: it doesn't keep Frederic in the list.
That's the intention of the OP, I believe.
StarWarsFan wrote:What would be the most elegant way to remove all duplicate lines so the result shows

Code: Select all

Tom
Barbara
Tim
Antonia
so 'Frederic' 3 times removed and 'Antonip' (should be Antonio) 2 times removed.
Texas Instruments TI-99/4A Home Computer: the first home computer with a 16bit processor, crammed into an 8bit architecture. Great hardware - Poor design - Wonderful BASIC engine. And it could talk too! Please visit my YouTube Channel :D
BarryG
Addict
Addict
Posts: 4128
Joined: Thu Apr 18, 2019 8:17 am

Re: Remove all duplicate lines from file

Post by BarryG »

Oh, I didn't notice that - sorry! I thought the OP actually wanted 1 x Frederic in the list because he said "no duplicates" - having one of something isn't a duplicate.
User avatar
TI-994A
Addict
Addict
Posts: 2700
Joined: Sat Feb 19, 2011 3:47 am
Location: Singapore
Contact:

Re: Remove all duplicate lines from file

Post by TI-994A »

BarryG wrote:...I thought the OP actually wanted 1 x Frederic in the list because he said "no duplicates" - having one of something isn't a duplicate.
My sentiments exactly. Especially with the Antonia/Antonio confusion. :lol:
Texas Instruments TI-99/4A Home Computer: the first home computer with a 16bit processor, crammed into an 8bit architecture. Great hardware - Poor design - Wonderful BASIC engine. And it could talk too! Please visit my YouTube Channel :D
User avatar
mk-soft
Always Here
Always Here
Posts: 6207
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: Remove all duplicate lines from file

Post by mk-soft »

I think that Maps is not the right choice here, because you lose the order and the lines can be very long. Even if it may take a little longer, I find lists better here.

Code: Select all

Global NewList Rows.s()
Global text.s, found

; Load data
Restore Lines
Repeat
  Read.s text
  If text = #ETX$
    Break
  EndIf
  ForEach Rows()
    If Rows() = text
      found = #True
      Break
    EndIf
  Next
  If found
    found = #False
  Else
    AddElement(Rows())
    Rows() = text
  EndIf
ForEver

; Output data
ForEach Rows()
  Debug Rows()
Next

DataSection
  Lines:
  Data.s "Tom"
  Data.s "Barbara"
  Data.s "Tim"
  Data.s "Antonio"
  Data.s "Frederic"
  Data.s "Frederic"
  Data.s "Frederic"
  Data.s "Antonia"
  Data.s "Antonio"
  Data.s #ETX$
EndDataSection
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
User avatar
Demivec
Addict
Addict
Posts: 4260
Joined: Mon Jul 25, 2005 3:51 pm
Location: Utah, USA

Re: Remove all duplicate lines from file

Post by Demivec »

Here's mine, for fun. It uses a map to remove duplicates and a list to output the results in the same line order as the original file. It also allows saving of the result file to a new file if desired:

Code: Select all

EnableExplicit
CompilerIf #PB_Compiler_Debugger =  0
  MessageRequester("Info", "Needs to be compiled with Debugger to show debug results." + #LF$ + "Save to File still possible.")
CompilerEndIf

Define pattern$ = "All supported formats|*.*;*.txt;", initpath$ = GetHomeDirectory()

NewMap textLines()
Define file$, fileID, lineNumber, textLine$

file$ = OpenFileRequester("Choose text file to process", initpath$ + "*.*", pattern$, 0)
If file$ 
  fileID = 0
  If ReadFile(fileID, file$)
    lineNumber = 0
    While Not Eof(FileID)
      lineNumber + 1
      textLine$ = ReadString(fileID)
      If FindMapElement(textLines(), textLine$) = #False
        textLines(textLine$) = lineNumber
      Else
        textLines(textLine$) = -1 ;signal to remove from list
      EndIf
    Wend
    CloseFile(FileID)
    
    Structure outputLine
      text.s
      lineNumber.i
    EndStructure
    
    NewList outputLines.outputLine()
    
    ForEach textLines()
      If textLines() > 0
        AddElement(outputLines())
        outputLines()\text = MapKey(textLines())
        outputLines()\lineNumber = textLines()
      EndIf
    Next
    
    ClearMap(textLines()) ;just a some tidying up to free up this copy of data
    SortStructuredList(outputLines(), #PB_Sort_Ascending, OffsetOf(outputLine\lineNumber), #PB_Integer)
    CompilerIf #PB_Compiler_Debugger
      ForEach outputLines()
        Debug outputLines()\text
      Next
    CompilerEndIf
    MessageRequester("Results", "Number of lines read from file : " + lineNumber + #LF$ +
                                "Number of non-duplicate lines: " + ListSize(outputLines()))
  EndIf
  
  Define answer$, sFile$, saveFileID
  answer$ = InputRequester("Results", "Save results to a new file (Y/N)?", "No")
  If UCase(Left(Trim(answer$), 1)) = "Y"
    sFile$ = SaveFileRequester("Please choose file to save to", initpath$ + "", pattern$, 0)
    Trim(sFile$)
    If sFile$
      If FileSize(sFile$) <> -1
        MessageRequester("Error", "Unable to create save file. File already exists or is a folder.")
        End ;more extensive recover could be implemented here but for demo this is enough
      EndIf
      
      saveFileID = OpenFile(#PB_Any, sFile$)
      If saveFileID
        ForEach outputLines()
          WriteStringN(saveFileID, outputLines()\text)
        Next
        CloseFile(saveFileID)
      EndIf
    EndIf
  EndIf
EndIf
Post Reply