Page 2 of 2

Posted: Wed May 21, 2008 4:05 pm
by Kiffi
pdwyer wrote:okay, here is my first port attempt
whow! The code is quite fast. :D

My first test shows 0.1 Seconds per 1000 lines. Your first code takes 2
Seconds per 1000 lines. (tested with a 280 MB CSV file)

Unfortunately multiple lines are not supported. But the code
is a good starting point... ;-)

Thanks for your help!

Greetings ... Kiffi

Posted: Wed May 21, 2008 4:16 pm
by DoubleDutch
Here is the one I wrote a while back to solve the quotes problem.

Code: Select all

Procedure.s X_StringField(string$,no,seperator$=",")
	done=#False
	count=1
	len=Len(string$)
	pos=0
	result$=""
	quotes=#False
	Repeat
		pos+1
		If pos>len
			done=#True
		Else
			ch$=Mid(string$,pos,1)
			If ch$=seperator$ And (Not quotes)
				If count=no
					done=#True
				Else
					result$=""
					count+1
				EndIf
			Else
				result$+ch$
				If ch$=Chr(34)
					quotes!#True
				EndIf
			EndIf
		EndIf
	Until done
	result$=Trim(result$)
	If count=no
		If Left(result$,1)=Chr(34)
			result$=Mid(result$,2)
		EndIf
		If Right(result$,1)=Chr(34)
			result$=Left(result$,Len(result$)-1)
		EndIf
	Else
		result$=""
	EndIf
	result$=Trim(result$)
	ProcedureReturn result$
EndProcedure
Seems to work ok for me. :D

Posted: Thu May 22, 2008 12:17 am
by pdwyer
Kiffi wrote:whow! The code is quite fast. :D

My first test shows 0.1 Seconds per 1000 lines. Your first code takes 2
Seconds per 1000 lines. (tested with a 280 MB CSV file)
:shock: news to me too, I hadn't tested that far. not my proc though so I can't take the credit :P
Kiffi wrote: Unfortunately multiple lines are not supported. But the code
is a good starting point... ;-)
I'm not so sure I want to support that. I saw that in the spec too. I have a fast split function that I would use on the crlf first and it wouldn't work if CSV had that in it. Personally, if I thought I was going to work with fields with crlfs in there I wouldn't use CSV, I'd use a DB (eg sqlite) or a format that doesn't use crlf as a line delimeter (ie has no sense of a line as such).

Either that or I'd have some escape char for a in-field crlf and put it back in later

@DoubleDutch, haven't looked at this yet, but theres some example csv on that page I linked, it looks like it would make a good basic test for complience, I try some of these tonight when I get home

Posted: Thu May 22, 2008 10:55 am
by pdwyer
There are two bugs in the code still (that I know of) and one bug I fixed in the previous code (an empty line would not clear the array if there was data so no 6 failed.

On the table at the top here http://www.xbeat.net/vbspeed/c_ParseCSV.php it still fails to correctly format no 18 and 22

Here is the test csv sheet

Code: Select all

a,b,c
"a",b,c
'a',b,c
 a , b , c 
aa,bb;cc

a
,b,
,,c
,,
"",b
" ",b
"a,b"
"a,b",c
" a , b ", c 
a b,c
a"b,c
"a""b",c
a""b,c
a,b",c
a,b"",c
a,"B: ""Hi, I'm B""",c

I'm still not quite sure how easy this is to fix, the proc works differently in PB due to the original use of lenb and midb in VB with unicode

Posted: Fri May 23, 2008 2:16 pm
by pdwyer
I think I have the bugs out now, it seems to do what I want it to do, there's still some room for performance increase so I might have another look at that later.

The code at the very start of the thread has been updated with the latest version as it was starting to get confusing for people coming later as to what version I was talking about :lol:

Posted: Tue Jul 01, 2008 11:33 am
by peterb
Enjoy,

peterb

Code: Select all


;- Author   : Petr Vavrin (peterb)
;- Location : Czech Republic
;- Email    : pb.pb (at) centrum (dot) cz 

Global characters = 0

#CSV_PARSER_COLUMNS = 200
 
Structure _CSVParseGlobals
  numberOfColumns.l
  Column$[#CSV_PARSER_COLUMNS]
EndStructure

Global CSVParse._CSVParseGlobals

Procedure ParseCSV ( csv_line.s, delimiter )

  numberOfColumns       = 0
  in_column             = #False
  column_string.s       = ""
  CSVParse\Column$[0]   = ""
 
  line_length = Len ( csv_line ) - 1
  
  For c = 0 To line_length
    char  = PeekB ( @csv_line + c )

    characters + 1 ; remove this line - for testing only

    If char = delimiter
      If c = 0
        numberOfColumns + 1
      
      ElseIf in_column And prev_char = '"'
        in_column     = #False
        CSVParse\Column$[ numberOfColumns ] = Left ( column_string, Len ( column_string ) - 1 )
        numberOfColumns + 1
        column_string = ""
      
      ElseIf in_column = #False
        CSVParse\Column$[ numberOfColumns ] = column_string
        numberOfColumns + 1
        column_string = ""
      Else
        column_string + Chr ( char )
      
      EndIf
    
    ElseIf char = '"'
      If c = 0
        in_column = #True
      
      ElseIf in_column And c = line_length
        in_column = #False
        
      ElseIf in_column = #False And prev_char = delimiter
        in_column = #True

      Else
        column_string + Chr ( char )

      EndIf
     
    Else
      column_string + Chr ( char )
    EndIf
    
    If char <> 32
      prev_char = char
    EndIf
  
  Next

  CSVParse\Column$[ numberOfColumns ] = column_string
  CSVParse\numberOfColumns            = numberOfColumns + 1
EndProcedure


; --- speed test ---

start = GetTickCount_()

For x = 1 To 10000
  pointer = ?start_data
  While pointer < ?end_data
    text.s = PeekS ( pointer )
    pointer + StringByteLength ( text ) + 1
    ParseCSV ( text, ',' )
  Wend
Next
time = GetTickCount_() - start

MessageRequester("", "time: " + Str ( time ) + " ms" + Chr( 10 ) + "characters: " + Str ( characters ) + Chr(10) + Str ( characters/time ) + " chr / ms ")

Debug "time: " + Str ( time ) + " ms"
Debug "characters: " + Str ( characters )
Debug Str ( characters/time ) + " chr / ms "
Debug ""

; --- show results ---

pointer = ?start_data
While pointer < ?end_data
  text.s = PeekS ( pointer )
  pointer + StringByteLength ( text ) + 1
  
  ParseCSV ( text, ',' )
  
  OUT.s = ""
  For i = 0 To CSVParse\numberOfColumns - 1
    OUT + CSVParse\Column$[i] + " | "
  Next 

  Debug text
  Debug OUT
  Debug ""
 
Wend

; --- source data ---

DataSection
  start_data:
    Data.s "a,b,c"
    Data.s Chr(34) + "a" + Chr(34) + ",b,c"
    Data.s "'a',b,c"
    Data.s "a , b , c"
    Data.s "aa,bb;cc"
    Data.s ""
    Data.s "a"
    Data.s ",b,"
    Data.s ",,c"
    Data.s ",,"
    Data.s Chr(34) + Chr(34) + ",b"
    Data.s Chr(34) + " " + Chr(34) + ",b"
    Data.s Chr(34) + "a,b" + Chr(34)
    Data.s Chr(34) + "a,b" + Chr(34) + ",c"
    Data.s Chr(34) + " a , b " + Chr(34) + ", c"
    Data.s "a b,c"
    Data.s "a" + Chr(34) + "b,c"
    Data.s Chr(34) + "a" + Chr(34) + Chr(34) + "b" + Chr(34) + ",c"
    Data.s "a" + Chr(34) + Chr(34) + "b,c"
    Data.s "a,b" + Chr(34) + ",c"
    Data.s "a,b"+ Chr(34) + Chr(34) + ",c"
    Data.s "a," + Chr(34) + "B: " + Chr(34) + Chr(34) + "Hi, I'm B" + Chr(34) + Chr(34) + Chr(34) +",c"
  end_data:
EndDataSection    


Re: CSV and Quotes

Posted: Sat Aug 07, 2021 1:27 pm
by pdwyer
I'm looking for a way to speed this up (a lot)

One thing I did find which seems to be a bit faster and a lot less code is a regex version...
If anyone knows of anything significantly faster, please share. In these days of bigger data this need is getting more common.

The magic of the regex for this code comes from the same link in the original post

Code: Select all

Declare.l ParseCSV02(sExpr.s, Array CSVFieldVals.s(1)) 


If CreateRegularExpression(0,"(\s*"+Chr(34)+"[^"+Chr(34)+"]*"+Chr(34)+"\s*,)|(\s*[^,]*\s*,)")
    
    Dim Vals.s(0)
    OpenFile(1,"F:\Programming\PureBasicCode\csv.csv")  ;change this!
    While Not Eof(1)
    
        CSVString.s = ReadString(1)
        ValCount = ParseCSV02(CSVString,Vals())
        Debug "Column Count: " + Str(ValCount) + "      " + CSVString
        
        For i = 0 To valcount -1
            Debug vals(i)
        Next
        
    Wend    
    CloseFile(1)

Else
    Debug RegularExpressionError()
EndIf




Procedure.l ParseCSV02(sExpr.s, Array CSVFieldVals.s(1)) 
    mc.l = ExtractRegularExpression(0,sExpr + ",",CSVFieldVals())
    
    For i = 0 To mc - 1
        CSVFieldVals(i) = Left(CSVFieldVals(i),Len(CSVFieldVals(i))-1)    
    Next

    ProcedureReturn mc
EndProcedure

Re: CSV and Quotes

Posted: Sat Aug 07, 2021 2:36 pm
by mk-soft
I don't know ist faster, but i use this

Link: viewtopic.php?f=12&t=69557

Re: CSV and Quotes

Posted: Sun Aug 08, 2021 2:48 am
by pdwyer
it is faster, thanks