Page 1 of 1

XML and big files

Posted: Fri Sep 27, 2013 10:28 am
by dige
Hi folks,

is it possible to load the XML as stream? Because it seems that
LoadXML() is always loading the whole file. I've a 1.8GB file with
data from openstreetmap and would like to parse ways and nodes
from buildings. But LoadXML crashes after 6 hours.

Unfortunately, the amount of data can not be reduced, unless I load
the file with ReadFile () and parse everything with FindString () etc. ...

What is recommended to do with bigger XML files?

Re: XML and big files

Posted: Fri Sep 27, 2013 10:35 am
by STARGÅTE
You can open the file with ReadFile and load your blocks with ReadData(), for example only 10MBs
Then you can use CatchXML(#XML, *Adresse, Laenge [, Flags [, Kodierung]]) with the flag:
#PB_XML_StreamStart
#PB_XML_StreamNext
#PB_XML_StreamEnd

Re: XML and big files

Posted: Fri Sep 27, 2013 11:37 am
by dige
Cooool!!! :-D Dankeschön STARGÅTE!

Re: XML and big files

Posted: Fri Sep 27, 2013 4:02 pm
by freak
Note that while this reads the input in blocks, the previously scanned data still remains in memory so at the end you have the entire XML file in memory. With such a large file that is not practical.

You can use the expat parser directly: It is a "streaming parser", which means you register callbacks for the information that you need and then you can parse the file in blocks and don't need to keep the entire file around.

The expat functions are available directly in PB with a "pb_" prefix.
The documentation is available here: http://expat.cvs.sourceforge.net/viewvc ... rence.html

Here is a quick example:

Code: Select all

; Expat returns UTF8-Strings in ascii mode and unicode strings in unicode mode
CompilerIf #PB_Compiler_Unicode
  Macro PeekExpat(ptr)
    PeekS(ptr)
  EndMacro
CompilerElse
  Macro PeekExpat(ptr)
    PeekS(ptr, -1, #PB_UTF8)
  EndMacro
CompilerEndIf


ProcedureC StartElementHandler(user_data, *name, *args)
  Debug "Start: " + PeekExpat(*name)
  
  ; Attribute values are an array of pointers with alternating name and value entries
  ; Terminated by null pointer
  *arg.INTEGER = *args
  While *arg\i <> 0
    Name$ = PeekExpat(*arg\i)
    *arg + SizeOf(Integer)
    Value$ = PeekExpat(*arg\i)
    *arg + SizeOf(Integer)
    Debug "             " + Name$ + "=" + Value$
  Wend  
EndProcedure

ProcedureC EndElementHandler(user_data, *name)
  Debug "End: " + PeekExpat(*name)
EndProcedure



If ReadFile(0, "c:\test.xml")

  ; initialize parser
  Parser = pb_XML_ParserCreate_(0)
  
  pb_XML_SetStartElementHandler_(Parser, @StartElementHandler())
  pb_XML_SetEndElementHandler_(Parser, @EndElementHandler())
  
  ; block size for streaming. this is very small as an example. Use something larger like 1Mb here for real files!
  BufferSize = 20
  *Buffer = AllocateMemory(BufferSize)

  While Not Eof(0)
    BytesRead = ReadData(0, *Buffer, BufferSize)
    If BytesRead > 0
      If pb_XML_Parse_(Parser, *Buffer, BytesRead, #False) = #XML_STATUS_ERROR
        ; parser error (message is in ascii)
        Debug "Parser Error (Line " + Str(pb_XML_GetCurrentLineNumber_(Parser)) + "): " + PeekS(pb_XML_ErrorString_(pb_XML_GetErrorCode_(Parser)), -1, #PB_Ascii)
        Break
      EndIf
    EndIf
  Wend
  
  ; important: finish the parsing process
  pb_XML_Parse_(Parser, *Buffer, 0, #True)
  pb_XML_ParserFree_(Parser)
  
  FreeMemory(*Buffer)  
  
  CloseFile(0)
Else
  Debug "Cannot open file"
EndIf

Re: XML and big files

Posted: Mon Sep 30, 2013 2:53 pm
by dige
Thank you freak, it fits to my needs :-)

Re: XML and big files

Posted: Thu Oct 03, 2013 9:34 pm
by said
Thank you Freak for this trick, very helpful (and very much needed when dealing with buf xml), small question: is this cross platform?

Re: XML and big files

Posted: Fri Oct 04, 2013 1:36 am
by freak
> is this cross platform?

Yes.