XML and big files

Everything else that doesn't fall into one of the other PB categories.
dige
Addict
Addict
Posts: 1416
Joined: Wed Apr 30, 2003 8:15 am
Location: Germany
Contact:

XML and big files

Post by dige »

Hi folks,

is it possible to load the XML as stream? Because it seems that
LoadXML() is always loading the whole file. I've a 1.8GB file with
data from openstreetmap and would like to parse ways and nodes
from buildings. But LoadXML crashes after 6 hours.

Unfortunately, the amount of data can not be reduced, unless I load
the file with ReadFile () and parse everything with FindString () etc. ...

What is recommended to do with bigger XML files?
"Daddy, I'll run faster, then it is not so far..."
User avatar
STARGÅTE
Addict
Addict
Posts: 2254
Joined: Thu Jan 10, 2008 1:30 pm
Location: Germany, Glienicke
Contact:

Re: XML and big files

Post by STARGÅTE »

You can open the file with ReadFile and load your blocks with ReadData(), for example only 10MBs
Then you can use CatchXML(#XML, *Adresse, Laenge [, Flags [, Kodierung]]) with the flag:
#PB_XML_StreamStart
#PB_XML_StreamNext
#PB_XML_StreamEnd
PB 6.01 ― Win 10, 21H2 ― Ryzen 9 3900X, 32 GB ― NVIDIA GeForce RTX 3080 ― Vivaldi 6.0 ― www.unionbytes.de
Lizard - Script language for symbolic calculations and moreTypeface - Sprite-based font include/module
dige
Addict
Addict
Posts: 1416
Joined: Wed Apr 30, 2003 8:15 am
Location: Germany
Contact:

Re: XML and big files

Post by dige »

Cooool!!! :-D Dankeschön STARGÅTE!
"Daddy, I'll run faster, then it is not so far..."
freak
PureBasic Team
PureBasic Team
Posts: 5946
Joined: Fri Apr 25, 2003 5:21 pm
Location: Germany

Re: XML and big files

Post by freak »

Note that while this reads the input in blocks, the previously scanned data still remains in memory so at the end you have the entire XML file in memory. With such a large file that is not practical.

You can use the expat parser directly: It is a "streaming parser", which means you register callbacks for the information that you need and then you can parse the file in blocks and don't need to keep the entire file around.

The expat functions are available directly in PB with a "pb_" prefix.
The documentation is available here: http://expat.cvs.sourceforge.net/viewvc ... rence.html

Here is a quick example:

Code: Select all

; Expat returns UTF8-Strings in ascii mode and unicode strings in unicode mode
CompilerIf #PB_Compiler_Unicode
  Macro PeekExpat(ptr)
    PeekS(ptr)
  EndMacro
CompilerElse
  Macro PeekExpat(ptr)
    PeekS(ptr, -1, #PB_UTF8)
  EndMacro
CompilerEndIf


ProcedureC StartElementHandler(user_data, *name, *args)
  Debug "Start: " + PeekExpat(*name)
  
  ; Attribute values are an array of pointers with alternating name and value entries
  ; Terminated by null pointer
  *arg.INTEGER = *args
  While *arg\i <> 0
    Name$ = PeekExpat(*arg\i)
    *arg + SizeOf(Integer)
    Value$ = PeekExpat(*arg\i)
    *arg + SizeOf(Integer)
    Debug "             " + Name$ + "=" + Value$
  Wend  
EndProcedure

ProcedureC EndElementHandler(user_data, *name)
  Debug "End: " + PeekExpat(*name)
EndProcedure



If ReadFile(0, "c:\test.xml")

  ; initialize parser
  Parser = pb_XML_ParserCreate_(0)
  
  pb_XML_SetStartElementHandler_(Parser, @StartElementHandler())
  pb_XML_SetEndElementHandler_(Parser, @EndElementHandler())
  
  ; block size for streaming. this is very small as an example. Use something larger like 1Mb here for real files!
  BufferSize = 20
  *Buffer = AllocateMemory(BufferSize)

  While Not Eof(0)
    BytesRead = ReadData(0, *Buffer, BufferSize)
    If BytesRead > 0
      If pb_XML_Parse_(Parser, *Buffer, BytesRead, #False) = #XML_STATUS_ERROR
        ; parser error (message is in ascii)
        Debug "Parser Error (Line " + Str(pb_XML_GetCurrentLineNumber_(Parser)) + "): " + PeekS(pb_XML_ErrorString_(pb_XML_GetErrorCode_(Parser)), -1, #PB_Ascii)
        Break
      EndIf
    EndIf
  Wend
  
  ; important: finish the parsing process
  pb_XML_Parse_(Parser, *Buffer, 0, #True)
  pb_XML_ParserFree_(Parser)
  
  FreeMemory(*Buffer)  
  
  CloseFile(0)
Else
  Debug "Cannot open file"
EndIf
quidquid Latine dictum sit altum videtur
dige
Addict
Addict
Posts: 1416
Joined: Wed Apr 30, 2003 8:15 am
Location: Germany
Contact:

Re: XML and big files

Post by dige »

Thank you freak, it fits to my needs :-)
"Daddy, I'll run faster, then it is not so far..."
said
Enthusiast
Enthusiast
Posts: 342
Joined: Thu Apr 14, 2011 6:07 pm

Re: XML and big files

Post by said »

Thank you Freak for this trick, very helpful (and very much needed when dealing with buf xml), small question: is this cross platform?
freak
PureBasic Team
PureBasic Team
Posts: 5946
Joined: Fri Apr 25, 2003 5:21 pm
Location: Germany

Re: XML and big files

Post by freak »

> is this cross platform?

Yes.
quidquid Latine dictum sit altum videtur
Post Reply