CSV and Quotes

Share your advanced PureBasic knowledge/code with the community.
User avatar
Kiffi
Addict
Addict
Posts: 1485
Joined: Tue Mar 02, 2004 1:20 pm
Location: Amphibios 9

Post by Kiffi »

pdwyer wrote:okay, here is my first port attempt
whow! The code is quite fast. :D

My first test shows 0.1 Seconds per 1000 lines. Your first code takes 2
Seconds per 1000 lines. (tested with a 280 MB CSV file)

Unfortunately multiple lines are not supported. But the code
is a good starting point... ;-)

Thanks for your help!

Greetings ... Kiffi
Hygge
User avatar
DoubleDutch
Addict
Addict
Posts: 3220
Joined: Thu Aug 07, 2003 7:01 pm
Location: United Kingdom
Contact:

Post by DoubleDutch »

Here is the one I wrote a while back to solve the quotes problem.

Code: Select all

Procedure.s X_StringField(string$,no,seperator$=",")
	done=#False
	count=1
	len=Len(string$)
	pos=0
	result$=""
	quotes=#False
	Repeat
		pos+1
		If pos>len
			done=#True
		Else
			ch$=Mid(string$,pos,1)
			If ch$=seperator$ And (Not quotes)
				If count=no
					done=#True
				Else
					result$=""
					count+1
				EndIf
			Else
				result$+ch$
				If ch$=Chr(34)
					quotes!#True
				EndIf
			EndIf
		EndIf
	Until done
	result$=Trim(result$)
	If count=no
		If Left(result$,1)=Chr(34)
			result$=Mid(result$,2)
		EndIf
		If Right(result$,1)=Chr(34)
			result$=Left(result$,Len(result$)-1)
		EndIf
	Else
		result$=""
	EndIf
	result$=Trim(result$)
	ProcedureReturn result$
EndProcedure
Seems to work ok for me. :D
https://deluxepixel.com <- My Business website
https://reportcomplete.com <- School end of term reports system
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

Kiffi wrote:whow! The code is quite fast. :D

My first test shows 0.1 Seconds per 1000 lines. Your first code takes 2
Seconds per 1000 lines. (tested with a 280 MB CSV file)
:shock: news to me too, I hadn't tested that far. not my proc though so I can't take the credit :P
Kiffi wrote: Unfortunately multiple lines are not supported. But the code
is a good starting point... ;-)
I'm not so sure I want to support that. I saw that in the spec too. I have a fast split function that I would use on the crlf first and it wouldn't work if CSV had that in it. Personally, if I thought I was going to work with fields with crlfs in there I wouldn't use CSV, I'd use a DB (eg sqlite) or a format that doesn't use crlf as a line delimeter (ie has no sense of a line as such).

Either that or I'd have some escape char for a in-field crlf and put it back in later

@DoubleDutch, haven't looked at this yet, but theres some example csv on that page I linked, it looks like it would make a good basic test for complience, I try some of these tonight when I get home
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

There are two bugs in the code still (that I know of) and one bug I fixed in the previous code (an empty line would not clear the array if there was data so no 6 failed.

On the table at the top here http://www.xbeat.net/vbspeed/c_ParseCSV.php it still fails to correctly format no 18 and 22

Here is the test csv sheet

Code: Select all

a,b,c
"a",b,c
'a',b,c
 a , b , c 
aa,bb;cc

a
,b,
,,c
,,
"",b
" ",b
"a,b"
"a,b",c
" a , b ", c 
a b,c
a"b,c
"a""b",c
a""b,c
a,b",c
a,b"",c
a,"B: ""Hi, I'm B""",c

I'm still not quite sure how easy this is to fix, the proc works differently in PB due to the original use of lenb and midb in VB with unicode
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Post by pdwyer »

I think I have the bugs out now, it seems to do what I want it to do, there's still some room for performance increase so I might have another look at that later.

The code at the very start of the thread has been updated with the latest version as it was starting to get confusing for people coming later as to what version I was talking about :lol:
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
peterb
User
User
Posts: 60
Joined: Sun Oct 02, 2005 8:55 am
Location: Czech Republic
Contact:

Post by peterb »

Enjoy,

peterb

Code: Select all


;- Author   : Petr Vavrin (peterb)
;- Location : Czech Republic
;- Email    : pb.pb (at) centrum (dot) cz 

Global characters = 0

#CSV_PARSER_COLUMNS = 200
 
Structure _CSVParseGlobals
  numberOfColumns.l
  Column$[#CSV_PARSER_COLUMNS]
EndStructure

Global CSVParse._CSVParseGlobals

Procedure ParseCSV ( csv_line.s, delimiter )

  numberOfColumns       = 0
  in_column             = #False
  column_string.s       = ""
  CSVParse\Column$[0]   = ""
 
  line_length = Len ( csv_line ) - 1
  
  For c = 0 To line_length
    char  = PeekB ( @csv_line + c )

    characters + 1 ; remove this line - for testing only

    If char = delimiter
      If c = 0
        numberOfColumns + 1
      
      ElseIf in_column And prev_char = '"'
        in_column     = #False
        CSVParse\Column$[ numberOfColumns ] = Left ( column_string, Len ( column_string ) - 1 )
        numberOfColumns + 1
        column_string = ""
      
      ElseIf in_column = #False
        CSVParse\Column$[ numberOfColumns ] = column_string
        numberOfColumns + 1
        column_string = ""
      Else
        column_string + Chr ( char )
      
      EndIf
    
    ElseIf char = '"'
      If c = 0
        in_column = #True
      
      ElseIf in_column And c = line_length
        in_column = #False
        
      ElseIf in_column = #False And prev_char = delimiter
        in_column = #True

      Else
        column_string + Chr ( char )

      EndIf
     
    Else
      column_string + Chr ( char )
    EndIf
    
    If char <> 32
      prev_char = char
    EndIf
  
  Next

  CSVParse\Column$[ numberOfColumns ] = column_string
  CSVParse\numberOfColumns            = numberOfColumns + 1
EndProcedure


; --- speed test ---

start = GetTickCount_()

For x = 1 To 10000
  pointer = ?start_data
  While pointer < ?end_data
    text.s = PeekS ( pointer )
    pointer + StringByteLength ( text ) + 1
    ParseCSV ( text, ',' )
  Wend
Next
time = GetTickCount_() - start

MessageRequester("", "time: " + Str ( time ) + " ms" + Chr( 10 ) + "characters: " + Str ( characters ) + Chr(10) + Str ( characters/time ) + " chr / ms ")

Debug "time: " + Str ( time ) + " ms"
Debug "characters: " + Str ( characters )
Debug Str ( characters/time ) + " chr / ms "
Debug ""

; --- show results ---

pointer = ?start_data
While pointer < ?end_data
  text.s = PeekS ( pointer )
  pointer + StringByteLength ( text ) + 1
  
  ParseCSV ( text, ',' )
  
  OUT.s = ""
  For i = 0 To CSVParse\numberOfColumns - 1
    OUT + CSVParse\Column$[i] + " | "
  Next 

  Debug text
  Debug OUT
  Debug ""
 
Wend

; --- source data ---

DataSection
  start_data:
    Data.s "a,b,c"
    Data.s Chr(34) + "a" + Chr(34) + ",b,c"
    Data.s "'a',b,c"
    Data.s "a , b , c"
    Data.s "aa,bb;cc"
    Data.s ""
    Data.s "a"
    Data.s ",b,"
    Data.s ",,c"
    Data.s ",,"
    Data.s Chr(34) + Chr(34) + ",b"
    Data.s Chr(34) + " " + Chr(34) + ",b"
    Data.s Chr(34) + "a,b" + Chr(34)
    Data.s Chr(34) + "a,b" + Chr(34) + ",c"
    Data.s Chr(34) + " a , b " + Chr(34) + ", c"
    Data.s "a b,c"
    Data.s "a" + Chr(34) + "b,c"
    Data.s Chr(34) + "a" + Chr(34) + Chr(34) + "b" + Chr(34) + ",c"
    Data.s "a" + Chr(34) + Chr(34) + "b,c"
    Data.s "a,b" + Chr(34) + ",c"
    Data.s "a,b"+ Chr(34) + Chr(34) + ",c"
    Data.s "a," + Chr(34) + "B: " + Chr(34) + Chr(34) + "Hi, I'm B" + Chr(34) + Chr(34) + Chr(34) +",c"
  end_data:
EndDataSection    

User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Re: CSV and Quotes

Post by pdwyer »

I'm looking for a way to speed this up (a lot)

One thing I did find which seems to be a bit faster and a lot less code is a regex version...
If anyone knows of anything significantly faster, please share. In these days of bigger data this need is getting more common.

The magic of the regex for this code comes from the same link in the original post

Code: Select all

Declare.l ParseCSV02(sExpr.s, Array CSVFieldVals.s(1)) 


If CreateRegularExpression(0,"(\s*"+Chr(34)+"[^"+Chr(34)+"]*"+Chr(34)+"\s*,)|(\s*[^,]*\s*,)")
    
    Dim Vals.s(0)
    OpenFile(1,"F:\Programming\PureBasicCode\csv.csv")  ;change this!
    While Not Eof(1)
    
        CSVString.s = ReadString(1)
        ValCount = ParseCSV02(CSVString,Vals())
        Debug "Column Count: " + Str(ValCount) + "      " + CSVString
        
        For i = 0 To valcount -1
            Debug vals(i)
        Next
        
    Wend    
    CloseFile(1)

Else
    Debug RegularExpressionError()
EndIf




Procedure.l ParseCSV02(sExpr.s, Array CSVFieldVals.s(1)) 
    mc.l = ExtractRegularExpression(0,sExpr + ",",CSVFieldVals())
    
    For i = 0 To mc - 1
        CSVFieldVals(i) = Left(CSVFieldVals(i),Len(CSVFieldVals(i))-1)    
    Next

    ProcedureReturn mc
EndProcedure
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
User avatar
mk-soft
Always Here
Always Here
Posts: 6205
Joined: Fri May 12, 2006 6:51 pm
Location: Germany

Re: CSV and Quotes

Post by mk-soft »

I don't know ist faster, but i use this

Link: viewtopic.php?f=12&t=69557
My Projects ThreadToGUI / OOP-BaseClass / EventDesigner V3
PB v3.30 / v5.75 - OS Mac Mini OSX 10.xx - VM Window Pro / Linux Ubuntu
Downloads on my Webspace / OneDrive
User avatar
pdwyer
Addict
Addict
Posts: 2813
Joined: Tue May 08, 2007 1:27 pm
Location: Chiba, Japan

Re: CSV and Quotes

Post by pdwyer »

it is faster, thanks
Paul Dwyer

“In nature, it’s not the strongest nor the most intelligent who survives. It’s the most adaptable to change” - Charles Darwin
“If you can't explain it to a six-year old you really don't understand it yourself.” - Albert Einstein
Post Reply