Save formatted XML

Share your advanced PureBasic knowledge/code with the community.
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Save formatted XML

Post by Little John »

Works also with PB 5.20

Hi all,

in the current PB version 4.30, FormatXML() does not work like several people expect, see e.g. this thread.
So here is an alternative way of saving formatted XML to a file.
The code of the prodedure SaveFormattedXml() was modified and improved by me after code by Thorsten.

SaveFormattedXml() preserves leading and trailing whitespace in the text of the XML nodes.

Enjoy!
Little John

//edit 2009-03-18:
- Added: Support for empty elements of the form <item/>.
- Added: Support for node type #PB_XML_Instruction.
- Changed: Simplified the code.

//edit 2009-08-09:
- Fixed: Leading or trailing whitespace is not removed from the text of the XML nodes anymore.
- Fixed: Code changed, so that it now also works as part of Unicode executables.
- Added: Improved error handling.
- Added: Option for saving the respective BOM to the output file.
- Changed: Demo code slightly altered.

Code: Select all

Procedure.i SaveFormattedXml (xmlId, xmlFile$, flags=0, indentStep=3)
   Protected *buffer, encoding, size, ofn, Lpos, Rpos, indent=0
   Protected xml$, prevLeft$, prevRight$, txt$, curTag$

   ; Initialize
   If IsXML(xmlId) = 0
      ProcedureReturn 0                                                 ; error
   EndIf

   encoding = GetXMLEncoding(xmlId)
   size = ExportXMLSize(xmlId)
   *buffer = AllocateMemory(size)
   If *buffer = 0
      ProcedureReturn 0                                                 ; error
   EndIf

   If ExportXML(xmlId, *buffer, size) = 0
      FreeMemory(*buffer)
      ProcedureReturn 0                                                 ; error
   EndIf

   xml$ = PeekS(*buffer, -1, encoding)
   FreeMemory(*buffer)

   ofn = CreateFile(#PB_Any, xmlFile$)
   If ofn = 0
      ProcedureReturn 0                                                 ; error
   EndIf

   If flags & #PB_XML_StringFormat
      WriteStringFormat(ofn, encoding)
   EndIf

   ; Get and write XML declaration
   Lpos = FindString(xml$, "<", 1)
   Rpos = FindString(xml$, ">", Lpos) + 1
   curTag$ = Mid(xml$, Lpos, Rpos-Lpos)
   WriteString(ofn, curTag$, encoding)

   ; Get and write the other elements
   Lpos = FindString(xml$, "<", Rpos)
   While Lpos
      prevLeft$  = Left(curTag$, 2)
      prevRight$ = Right(curTag$, 2)

      txt$ = Mid(xml$, Rpos, Lpos-Rpos)

      If Mid(xml$, Lpos, 9) = "<![CDATA["
         Rpos = FindString(xml$, "]]>", Lpos) + 3
      Else
         Rpos = FindString(xml$, ">", Lpos) + 1
      EndIf
      curTag$ = Mid(xml$, Lpos, Rpos-Lpos)

      If FindString("</<!<?", prevLeft$, 1) = 0 And prevRight$ <> "/>"
         If Left(curTag$, 2) = "</"                                     ; <tag>text</tag>
            WriteString(ofn, txt$ + curTag$, encoding)
         Else                                                           ; <tag1><tag2>
            indent + indentStep
            WriteString(ofn, #LF$ + Space(indent) + curTag$, encoding)
         EndIf
      Else
         If Left(curTag$, 2) = "</"                                     ; </tag2>text</tag1>
            If Len(txt$)
               WriteString(ofn, #LF$ + Space(indent) + txt$, encoding)
            EndIf
            indent - indentStep
         EndIf
         WriteString(ofn, #LF$ + Space(indent) + curTag$, encoding)
      EndIf

      Lpos = FindString(xml$, "<", Rpos)
   Wend

   CloseFile(ofn)
   ProcedureReturn 1                                                    ; success
EndProcedure

Code: Select all

;-- Demo
EnableExplicit

; XIncludeFile "XML.pbi"
XIncludeFile "neu.pbi"

Define Xml, Main, Item, SubItem

Xml = CreateXML(#PB_Any, #PB_UTF8)
Main = CreateXMLNode(RootXMLNode(Xml))
SetXMLNodeName(Main, "root")
SetXMLNodeText(Main, "XML zoo")

Item = CreateXMLNode(Main, -1, #PB_XML_CData)
SetXMLNodeText(Item, "Did you know: 5 > 3 ?")

Item = CreateXMLNode(Main)
SetXMLNodeName(Item, "item")
SetXMLAttribute(Item, "id", "1")
SetXMLNodeText(Item, "Cat")

Item = CreateXMLNode(Main)
SetXMLNodeName(Item, "item")
SetXMLAttribute(Item, "id", "3")
SetXMLNodeText(Item, "Bird")

SubItem = CreateXMLNode(Item)
SetXMLNodeName(SubItem, "subitem")
SetXMLAttribute(SubItem, "id", "4")
SetXMLNodeText(SubItem, "Cuckoo")

SubItem = CreateXMLNode(Item, -1, #PB_XML_Comment)
SetXMLNodeText(SubItem, " Lives in clocks in the Black Forest ")

Item = CreateXMLNode(Main)
SetXMLNodeName(Item, "item")
SetXMLAttribute(Item, "id", "5")

Item = CreateXMLNode(Main, -1, #PB_XML_Instruction)
SetXMLNodeName(Item, "php")
SetXMLNodeText(Item, "...")

Item = CreateXMLNode(Main)
SetXMLNodeName(Item, "item")
SetXMLAttribute(Item, "id", "6")
SetXMLNodeText(Item, " ")

If SaveFormattedXml(Xml, "animals.xml") = 0
   MessageRequester("Error", "Can't save XML file.")
EndIf
FreeXML(Xml)
The output is:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <![CDATA[Did you know: 5 > 3 ?]]>
   <item id="1">Cat</item>
   <item id="3">
      <subitem id="4">Cuckoo</subitem>
      <!-- Lives in clocks in the Black Forest -->
      Bird
   </item>
   <item id="5"/>
   <?php ...?>
   <item id="6"> </item>
   XML zoo
</root>
Last edited by Little John on Sun Aug 18, 2013 10:36 pm, edited 6 times in total.
User avatar
Fluid Byte
Addict
Addict
Posts: 2336
Joined: Fri Jul 21, 2006 4:41 am
Location: Berlin, Germany

Post by Fluid Byte »

I have to do same speed tests on large files but for now it's doing the job quite well. :)
Windows 10 Pro, 64-Bit / Whose Hoff is it anyway?
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Post by Little John »

Fluid Byte wrote:I have to do same speed tests on large files but for now it's doing the job quite well. :)
You probably didn't try hard enough. ;-)

I extended the demo code, and that uncovered a bug. The bug is fixed now, and the code of SaveFormattedXml() also is simpler and probably faster than before ... so it's ready for your speed tests now. ;-)

Regards, Little John
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Post by Little John »

It's a slightly different situation, when there already is a ( not good looking ;-) ) XML file, and we want to convert it to a nicely formatted style.
The code is similar to the one above. It does not use any of PB's built-in XML commands, though.

For testing, you need an XML input file. Save e.g. this XML stuff from Perkin to a file named "old.xml".

Caution: In the text of XML nodes, leading and trailing whitespace might be removed!

//edit 2009-07-09:
- fixed two small glitches
- changed the code, so that it also works for Unicode executables

Code: Select all

EnableExplicit

#WHITESPACE$ = " " + #TAB$ + #CRLF$

Procedure.s TrimChars (source$, charList$=#WHITESPACE$)
   ; in : string
   ; out: string without any leading or trailing characters,
   ;      that are contained in 'charList$'
   Protected left, right, length=Len(source$)

   ; Trim left
   left = 1
   While (left <= length) And FindString(charList$, Mid(source$,left,1), 1)
      left + 1
   Wend

   ; Trim right
   right = length
   While (left < right) And FindString(charList$, Mid(source$,right,1), 1)
      right - 1
   Wend

   ProcedureReturn Mid(source$, left, right-left+1)
EndProcedure

Procedure.i FormatXMLfile (inFile$, outFile$, indentStep=3)
   Protected ifn, ofn, encoding, Lpos, Rpos, indent=0
   Protected size.q, bytes.q, *buffer
   Protected xml$, txt$, prevLeft$, prevRight$, curTag$

   ; Initialize
   ifn = ReadFile(#PB_Any, inFile$)
   If ifn = 0
      ProcedureReturn 0                                                 ; error
   EndIf

   encoding = ReadStringFormat(ifn)
   size = Lof(ifn) - Loc(ifn)
   *buffer = AllocateMemory(size + SizeOf(Character))
   If *buffer = 0
      CloseFile(ifn)
      ProcedureReturn 0                                                 ; error
   EndIf

   bytes = ReadData(ifn, *buffer, size)
   PokeC(*buffer + size, 0)
   CloseFile(ifn)
   If bytes <> size
      FreeMemory(*buffer)
      ProcedureReturn 0                                                 ; error
   EndIf

   xml$ = PeekS(*buffer, -1, encoding)
   FreeMemory(*buffer)

   ofn = CreateFile(#PB_Any, outFile$)
   If ofn = 0
      ProcedureReturn 0                                                 ; error
   EndIf

   WriteStringFormat(ofn, encoding)

   ; Get and write XML declaration
   Lpos = FindString(xml$, "<", 1)
   Rpos = FindString(xml$, ">", Lpos)
   curTag$ = Mid(xml$, Lpos, Rpos-Lpos+1)
   WriteString(ofn, curTag$, encoding)

   ; Get and write the other elements
   Lpos = FindString(xml$, "<", Rpos)
   While Lpos
      prevLeft$  = Left(curTag$, 2)
      prevRight$ = Right(curTag$, 2)

      txt$ = TrimChars(Mid(xml$, Rpos+1, Lpos-Rpos-1))

      If Mid(xml$, Lpos, 9) = "<![CDATA["
         Rpos = FindString(xml$, "]]>", Lpos) + 2
      Else
         Rpos = FindString(xml$, ">", Lpos)
      EndIf
      curTag$ = Mid(xml$, Lpos, Rpos-Lpos+1)

      If FindString("</<!<?", prevLeft$, 1) = 0 And prevRight$ <> "/>"
         If Left(curTag$, 2) = "</"                                     ; <tag>text</tag>
            WriteString(ofn, txt$ + curTag$, encoding)
         Else                                                           ; <tag1><tag2>
            indent + indentStep
            WriteString(ofn, #LF$ + Space(indent) + curTag$, encoding)
         EndIf
      Else
         If Left(curTag$, 2) = "</"                                     ; </tag2>text</tag1>
            If Len(txt$)
               WriteString(ofn, #LF$ + Space(indent) + txt$, encoding)
            EndIf
            indent - indentStep
         EndIf
         WriteString(ofn, #LF$ + Space(indent) + curTag$, encoding)
      EndIf

      Lpos = FindString(xml$, "<", Rpos)
   Wend

   CloseFile(ofn)
   ProcedureReturn 1                                                    ; success
EndProcedure


;-- Demo
Debug FormatXMLfile("old.xml", "new.xml")
With huge files, this code might not work well because it stores the whole input file into one chunk in memory. A more elegant solution would be, to read a "stream" of data from the input file, i.e. one small piece at a time.
Since there was a request for it, for now I mainly want to provide working code. Maybe me or someone else will improve it later.

Regards, Little John
Last edited by Little John on Sun Aug 09, 2009 6:12 pm, edited 2 times in total.
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Post by Little John »

Little John wrote:A more elegant solution would be, to read a "stream" of data from the input file, i.e. one small piece at a time.
The following code does do so. It contains the function XmlToken(), which is a basic lexer for XML files, and can also be used in any other context when dealing with XML files.

Caution: In the text of XML nodes, leading and trailing whitespace might be removed!

Regards, Little John

Code: Select all

EnableExplicit

;-----------------------------------------------------------------------

#WHITESPACE$ = " " + #TAB$ + #CRLF$

Procedure.i Verify (source$, charList$, start=1)
   ; -- Utility function for XmlToken()
   ; Searches for the first character in 'source$', that is not contained
   ; in 'charList$', beginning at position 'start';
   ; returns the position of that character, or 0 if nothing found
   Protected sourceEnd, charListEnd, found, *s.Character, *c.Character

   #step = SizeOf(Character)
   sourceEnd = @source$ + (Len(source$)-1)*#step
   charListEnd = @charList$ + (Len(charList$)-1)*#step

   For *s = @source$ + (start-1)*#step To sourceEnd Step #step
      found = #True
      For *c = @charList$ To charListEnd Step #step
         If *s\c = *c\c
            found = #False
            Break
         EndIf
      Next
      If found
         ProcedureReturn (*s - @source$)/#step + 1
      EndIf
   Next

   ProcedureReturn 0
EndProcedure


#XML_ChunkSize = 4096

Macro FillXmlBuffer
   bytes = ReadData(fileNo, *buffer, #XML_ChunkSize)
   PokeC(*buffer + bytes, 0)
   If bytes < #XML_ChunkSize
      lastChunk = #True
   EndIf
EndMacro

Enumeration 1         ; token types
   #XML_StartTag
   #XML_EndTag
   #XML_EmptyElement
   #XML_Comment
   #XML_CData
   #XML_Instruction
   #XML_Text
EndEnumeration

Procedure.i XmlToken (ifn, encoding, *token.String)
   ; -- Read next XML token from file 'ifn',
   ;    can only be used for one file at a time
   ; in : ifn         : number of file opened for reading on first call,
   ;                    0 on subsequent calls
   ; out: *token\s    : current token
   ;      return value: type of current token, see constants above
   ;                    ( 0 = normal end of file,
   ;                     -1 = error)
   Static fileNo, lastChunk, ret, posn, *buffer
   Static xml$
   Protected bytes, tokenStart, tokenEnd, tokenType

   ; When reading the first token
   If ifn > 0
      fileNo = ifn
      lastChunk = #False
      ret = 0
      *buffer = ReAllocateMemory(*buffer, #XML_ChunkSize + SizeOf(Character))
      If *buffer = 0
         ProcedureReturn -1                           ; error
      EndIf

      FillXmlBuffer
      xml$ = PeekS(*buffer, -1, encoding)
      posn = 0
   EndIf

   ; Find start of token
   tokenStart = Verify(xml$, #WHITESPACE$, posn+1)
   While tokenStart = 0
      If lastChunk = #True
         *token\s = ""
         ProcedureReturn ret
      EndIf

      FillXmlBuffer
      xml$ = PeekS(*buffer, -1, encoding)
      tokenStart = Verify(xml$, #WHITESPACE$, 1)
   Wend

   ; Find end and type of token
   Repeat
      If Mid(xml$, tokenStart, 1) <> "<"
         tokenType = #XML_Text
         posn = FindString(xml$, "<", tokenStart+1) - 1
         If posn > -1
            tokenEnd = posn
            ; Trim right:
            While tokenStart <= tokenEnd And FindString(#WHITESPACE$, Mid(xml$, tokenEnd, 1), 1) <> 0
               tokenEnd - 1
            Wend
            Break
         EndIf
      ElseIf Mid(xml$, tokenStart, 4) = "<!--"
         tokenType = #XML_Comment
         posn = FindString(xml$, "-->", tokenStart+1) + 2
         If posn > 2
            tokenEnd = posn
            Break
         EndIf
      ElseIf Mid(xml$, tokenStart, 9) = "<![CDATA["
         tokenType = #XML_CData
         posn = FindString(xml$, "]]>", tokenStart+1) + 2
         If posn > 2
            tokenEnd = posn
            Break
         EndIf
      ElseIf Mid(xml$, tokenStart, 2) = "<?"
         tokenType = #XML_Instruction
         posn = FindString(xml$, "?>", tokenStart+1) + 1
         If posn > 1
            tokenEnd = posn
            Break
         EndIf
      Else
         If Mid(xml$, tokenStart+1, 1) = "/"
            tokenType = #XML_EndTag
         Else
            tokenType = #XML_StartTag
         EndIf
         posn = FindString(xml$, ">", tokenStart+1)
         If posn > 0
            If Mid(xml$, posn-1, 1) = "/"
               tokenType = #XML_EmptyElement
            EndIf
            tokenEnd = posn
            Break
         EndIf
      EndIf

      If lastChunk = #True
         ret = -1                                     ; error: end of token not found
         *token\s = Mid(xml$, tokenStart)
         ProcedureReturn tokenType
      EndIf

      FillXmlBuffer
      xml$ = Mid(xml$, tokenStart) + PeekS(*buffer, -1, encoding)
      tokenStart = 1
   ForEver

   *token\s = Mid(xml$, tokenStart, tokenEnd-tokenStart+1)
   ProcedureReturn tokenType
EndProcedure

;-----------------------------------------------------------------------

Procedure.i FormatXMLfile (infile$, outfile$, indentStep=3)
   ; in : infile$     : input file
   ;      outfile$    : output file
   ;      indentStep  : number of blanks used for one step of indenting
   ; out: return value:  0 on success,
   ;                    -1 on error
   Protected ifn, ofn, encoding, indent, prevTag, tokenType
   Protected token.String

   ; Initialize
   ifn = ReadFile(#PB_Any, infile$)
   If ifn = 0
      ProcedureReturn -1   ; error
   EndIf

   ofn = CreateFile(#PB_Any, outfile$)
   If ofn = 0
      CloseFile(ifn)
      ProcedureReturn -1   ; error
   EndIf

   encoding = ReadStringFormat(ifn)
   WriteStringFormat(ofn, encoding)

   ; Read and write XML tokens
   indent = 0
   prevTag = 0
   tokenType = XmlToken(ifn, encoding, token)    ; read first token
   While tokenType > 0
      If prevTag = #XML_StartTag
         If tokenType <> #XML_Text And tokenType <> #XML_EndTag
            indent + indentStep
            WriteString(ofn, #LF$ + Space(indent), encoding)
         EndIf
         WriteString(ofn, token\s, encoding)
      Else
         If tokenType = #XML_EndTag
            indent - indentStep
         EndIf
         If prevTag <> 0
            WriteString(ofn, #LF$, encoding)
         EndIf
         WriteString(ofn, Space(indent) + token\s, encoding)
      EndIf

      If tokenType <> #XML_Text
         prevTag = tokenType
      EndIf
      tokenType = XmlToken(0, encoding, token)   ; read next token
   Wend

   CloseFile(ifn)
   CloseFile(ofn)
   ProcedureReturn tokenType
EndProcedure


;-- Demo
Debug FormatXMLfile("old.xml", "new.xml")
User avatar
HwyStar
Enthusiast
Enthusiast
Posts: 101
Joined: Mon Apr 05, 2010 7:13 pm
Location: Reno, Nevada

Re: Save formatted XML

Post by HwyStar »

If you are still here Little John, Thanks so much for sharing this code!

I needed some XML files that are stored as one long string to be converted into CR+LF character lines and your procedure seems to have done the trick.

I will test it with huge XML files to make sure it is robust enough in a production environment and will report back here if I can break it.

Thanks again!
Little John
Addict
Addict
Posts: 4527
Joined: Thu Jun 07, 2007 3:25 pm
Location: Berlin, Germany

Re: Save formatted XML

Post by Little John »

Hi HwyStar,

thanks for your kind words!
You are welcome! I hope the code works well for you.

Best regards, LJ
Post Reply