Formating XML files

Fluid Byte · Post by **Fluid Byte** » Wed Mar 11, 2009 11:54 pm

I want to bring XML files created in PB into a specific form but the results are a rather disappointing.

This is the form that PB creates by default:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?><root><item id="1">Wheelchair</item><item id="2">Skyscraper</item><item id="3">Computer</item></root>

This is the result after formating wich is not good at all. The text in the <item> nodes include spaces and linebreaks.

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>

<root>
  <item id="1">
    Wheelchair
  </item>
  <item id="2">
    Skyscraper
  </item>
  <item id="3">
    Computer
  </item>
</root>

And this is how should be:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<root>
	<item id="1">Wheelchair</item>
	<item id="2">Skyscraper</item>
	<item id="3">Computer</item>
</root>

Have a look at this snippet, specifically the output of the third text field:

Code: Select all

CreateXML(0)
Main = CreateXMLNode(RootXMLNode(0))
SetXMLNodeName(Main,"root")

Item = CreateXMLNode(Main)
SetXMLNodeName(Item,"item")
SetXMLNodeText(Item,"Wheelchair")
SetXMLAttribute(Item,"id","1")

Item = CreateXMLNode(Main)
SetXMLNodeName(Item,"item")
SetXMLNodeText(Item,"Skyscraper")
SetXMLAttribute(Item,"id","2")

Item = CreateXMLNode(Main)
SetXMLNodeName(Item,"item")
SetXMLNodeText(Item,"Computer")
SetXMLAttribute(Item,"id","3")

SaveXML(0,GetTemporaryDirectory() + "test1.xml")
LoadXML(1,GetTemporaryDirectory() + "test1.xml")
FormatXML(1,#PB_XML_ReFormat)
SaveXML(1,GetTemporaryDirectory() + "test2.xml")

ReadFile(0,GetTemporaryDirectory() + "test1.xml")
lpBuffer1 = AllocateMemory(Lof(0))
ReadData(0,lpBuffer1,Lof(0))
CloseFile(0)

ReadFile(0,GetTemporaryDirectory() + "test2.xml")
lpBuffer2 = AllocateMemory(Lof(0))
ReadData(0,lpBuffer2,Lof(0))
CloseFile(0)

OpenWindow(0,0,0,400,380,"void",#PB_Window_SystemMenu | #PB_Window_ScreenCentered)
TextGadget(0,5,5,200,20,"UNFORMATTED:")
EditorGadget(1,5,25,390,100)
TextGadget(2,5,130,200,20,"FORMATTED:")
EditorGadget(3,5,150,390,100)
TextGadget(4,5,255,205,20,"HOW IT SHOULD BE:")
EditorGadget(5,5,275,390,100)

SetGadgetText(1,PeekS(lpBuffer1))
SetGadgetText(3,PeekS(lpBuffer2))

Macro DQ : + Chr(34) + : EndMacro

XML$ = "<?xml version="DQ"1.0"DQ" encoding="DQ"UTF-8"DQ"?>" + #CRLF$
XML$ + "<root>" + #CRLF$
XML$ + #TAB$ + "<item id="DQ"1"DQ">Wheelchair</item>" + #CRLF$
XML$ + #TAB$ + "<item id="DQ"2"DQ">Skyscraper</item>" + #CRLF$
XML$ + #TAB$ + "<item id="DQ"3"DQ">Computer</item>" + #CRLF$
XML$ + "</root>"

SetGadgetText(5,XML$)

While WaitWindowEvent() ! #PB_Event_CloseWindow : Wend

Rescator · Post by **Rescator** » Thu Mar 12, 2009 4:30 am

That is mostly a presentation issue, not sure about XML but like HTML i believe that all leading and trailing whitespaces are ignored or trimmed so both hose results are actually valid per the standard.
it messes up binary or string compares unless you trim them first obviously.
PS! If you need leading or trailing spaces html has nonblank space and I suspect XML has that too.

pdwyer · Post by **pdwyer** » Thu Mar 12, 2009 6:21 am

Agree, XML has quoting too if leading or trailing white space is needed.

Post by **Kiffi** » Thu Mar 12, 2009 8:40 am

Rescator wrote:That is mostly a presentation issue

yes, of course, but i think this is a bug (not sure if from PB or from
Expat) because FormatXML changes the nodetext-data

Code: Select all

<item id="1">Wheelchair</item>

Code: Select all

<item id="1">
  Wheelchair
</item>

the nodetext of the second one contains an extra linebreak and several spaces.

And reformatting with #PB_XML_CutNewline, #PB_XML_ReduceNewline,
#PB_XML_CutSpace and #PB_XML_CutNewline is not the solution.

IMO the nodetext may not be reformatted.

Greetings ... Kiffi

// Edit:

Here is the COMate-Way:

Code: Select all

IncludePath #PB_Compiler_Home + "\srod\comate"
XIncludeFile "comate.pbi"

EnableExplicit

Procedure.s BeautifyXml(XML.s)
  
  Protected ReturnString.s
  Protected XSL.s
  
  XSL = "<?xml version='1.0'?>"
  XSL + "<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>"
  XSL + "<xsl:output method='xml' omit-xml-declaration='no' indent='yes'/>"
  XSL + "</xsl:stylesheet>"
  
  Protected MsXmlObject.COMateObject
  Protected MsXslObject.COMateObject
  
  MsXmlObject = COMate_CreateObject("MSXML.DOMDocument")
  MsXslObject = COMate_CreateObject("MSXML.DOMDocument")
  
  If MsXmlObject And MsXslObject
    
    XML = ReplaceString(XML, "><", ">" + #CRLF$ + "<")
    XML = ReplaceString(XML, "'", "$0027")
    
    MsXmlObject\Invoke("LoadXml('" + XML + "')")
    
    MsXslObject\Invoke("LoadXml('" + XSL + "')")
    
    MsXmlObject\Invoke("transformNodeToObject(" + Str(MsXslObject) + " As COMateObject, " + Str(MsXmlObject) + " ByRef )")
    
    ReturnString = MsXmlObject\GetStringProperty("Xml")
    
    MsXslObject\Release()
    MsXmlObject\Release()
    
  EndIf
   
  ProcedureReturn ReturnString
  
EndProcedure

Define XML.s

XML = "<?xml version=" + Chr(34) + "1.0" + Chr(34) + " encoding=" + Chr(34) + "UTF-8" + Chr(34) + "?><root><item id=" + Chr(34) + "1" + Chr(34) + ">Wheelchair</item><item id=" + Chr(34) + "2" + Chr(34) + ">Skyscraper</item><item id=" + Chr(34) + "3" + Chr(34) + ">Computer</item></root>"

Define PbXmlObject

PbXmlObject = CatchXML(#PB_Any, @XML, Len(XML))
; FormatXML(PbXmlObject, #PB_XML_ReduceNewline)
XML = Space(ExportXMLSize(PbXmlObject))
ExportXML(PbXmlObject, @XML, Len(XML))
FreeXML(PbXmlObject)

MessageRequester("Before:", XML)

MessageRequester("After:", BeautifyXml(XML))

Note, that this solution is not perfect. If a CDATA contains '><', the
nodetext will also changed in a not allowed way. But with a clever
ReplaceRegularExpression() (instead of the simple ReplaceString()) there
is a chance to fix this.

Post by **freak** » Thu Mar 12, 2009 1:40 pm

Expat only does the initial document parsing. Everything else is the PB lib.

> IMO the nodetext may not be reformatted.

Strictly speaking, everything withing the main node that is not in <> is node text, including whitespace and newlines. So if you do not want to modify any of this you cannot reformat at all.

Whitespace is actually considered part of the node, this is why the XML lib does not cut/reduce and whitespace/newline when parsing a document. FormatXML() is the only function that changes the XML content like this.
http://www.w3.org/TR/REC-xml/#sec-white-space

You should also note that even if you format the XML the way you like, the user can always edit the file and add some newlines in unexpected places. Your program shouldn't fail if that happens. You either need to be prepared for that to happen, or make it clear to your users that newline is important in your XML documents (which it usually isn't).

Anyway, if you can give me a clear set of rules by which to format the document then i can try to implement that for the next version.

Fluid Byte · Post by **Fluid Byte** » Thu Mar 12, 2009 4:19 pm

Even if whitespace and newlines are valid as node text why the structure of my document is raped like that? Shouldn't it be preserved? My document almost doubles in size just because of redundant spaces and linebreaks

And it gets worse. Besides the bloated document you are actually parsing the document twice now because you have to remove linebreaks and spaces yourself. I'm working with very large databases so it would/does slow down the process of reading the XML significantly.

Anyway, if you can give me a clear set of rules by which to format the document then I can try to implement that for the next version.

No clue what you expect us to say. I just want it to be optional that the reformatting injects unwanted linebreaks and spaces into my node text.

[edit:]
Maybe you can specify that only after end-tags a newline is forced.

freak wrote:You should also note that even if you format the XML the way you like, the user can always edit the file and add some newlines in unexpected places.

Generally you are right but ...

My program, my rules + reasons above (bloat & speed)

Post by **freak** » Thu Mar 12, 2009 5:42 pm

Doh! I was just trying to help :roll:

Nobody forces you to use the function. Its a convenience function to improve readability, thats all. The XML output is valid even without it.

Rescator · Post by **Rescator** » Thu Mar 12, 2009 5:49 pm

Like a "compact" mode flag?

pdwyer · Post by **pdwyer** » Thu Mar 12, 2009 5:51 pm

FB, what if you change the way you use XML (not sure if you have this option).

When MS uses XML the standard they tend to adhere to (And this is not an XML rule nor even defacto standard) is that if you think of your xml file as a tree view, elements are folders and attributes are leaf nodes

Think of the registry display in regedit, folders are elements and in an element you can have further elements and attribues

So, an element never contains just a value like you have, it contains future elements or attributes and the data in the attributes is all quoted.

but, I don't mean to say its wrong to not do it like this, but sometimes it helps to have a standard on how to use something like XML just like you have in your head a standard on how you write code, name variables etc.

Maybe MS created their standard because they had the same issue that you have and wanted to lock up the leaf data better

Rescator · Post by **Rescator** » Thu Mar 12, 2009 5:56 pm

You know, after looking at the test code I realized two things.

1. All three outputs are all valid.
2. Your code makes the output really messed up with unicode enabled.

Little John · Post by **Little John** » Thu Mar 12, 2009 6:24 pm

freak wrote:Anyway, if you can give me a clear set of rules by which to format the document then i can try to implement that for the next version.

How about the way Thorsten does it here?
(Some optimization is possible.)

Regards, Little John

Fluid Byte · Post by **Fluid Byte** » Thu Mar 12, 2009 6:25 pm

freak wrote:Doh! I was just trying to help :roll:

Which I appreciate, no need for rolling eyes.

pdwyer wrote:So, an element never contains just a value like you have, it contains future elements or attributes and the data in the attributes is all quoted.

I my case it does. The document structure you see is the structure that will be used, exactly.

pdwyer wrote:but, I don't mean to say its wrong to not do it like this, but sometimes it helps to have a standard on how to use something like XML just like you have in your head a standard on how you write code, name variables etc.

If this "standard" involves to bloat your document and slows down the process of parsing it I'm not interested at all.

Rescator wrote:1. All three outputs are all valid.

I know. Your point?

Rescator wrote:2. Your code makes the output really messed up with unicode enabled.

This has nothing to do with my problem but thanks for letting me know ...

Fluid Byte · Post by **Fluid Byte** » Thu Mar 12, 2009 6:27 pm

Little John wrote:How about the way Thorsten does it here?
(Some optimization is possible.)

Thanks for spotting this one! So I'm not alone with my problem.

pdwyer · Post by **pdwyer** » Fri Mar 13, 2009 1:52 am

Fluid Byte wrote:
pdwyer wrote:So, an element never contains just a value like you have, it contains future elements or attributes and the data in the attributes is all quoted.
I my case it does. The document structure you see is the structure that will be used, exactly.

I was wondering "does it have to" though for some compatibility purpose? because there are other ways of doing it that don't have this problem you are having

Fluid Byte wrote:
pdwyer wrote:but, I don't mean to say its wrong to not do it like this, but sometimes it helps to have a standard on how to use something like XML just like you have in your head a standard on how you write code, name variables etc.
If this "standard" involves to bloat your document and slows down the process of parsing it I'm not interested at all.

I don't understand where this comes from at all, it's just a way of using XML so that you don't have the white space problem in formatting, ever, and won't need to worry about a second pass to fix formatted code not worry about users creating compatibility problems from resaving the XML with some other editor. Where is the bloat above and beyond what you have nor can I see how the speed would get worse?

And if you don't want bloat or slow, why use XML at all? It exists because it's all text and human readable and editable, not because it's fast or compact.

Little John · Post by **Little John** » Fri Mar 13, 2009 7:07 am

freak wrote:You should also note that even if you format the XML the way you like, the user can always edit the file and add some newlines in unexpected places. Your program shouldn't fail if that happens. You either need to be prepared for that to happen, or make it clear to your users that newline is important in your XML documents (which it usually isn't).

It's a different situation. We can tell the user something like: "This is a machine-generated file. Do not change it manually.".

But when we just write XML to a file, and then re-read that file, the contents of the file should be the same, no? With the following code, this is not the case, and that's the problem IMHO:

Code: Select all

;-- write XML to file
CreateXML(0)
Main = CreateXMLNode(RootXMLNode(0))
SetXMLNodeName(Main, "root")

Item = CreateXMLNode(Main)
SetXMLNodeName(Item, "item")
SetXMLNodeText(Item, "Wheelchair")
SetXMLAttribute(Item, "id", "1")

FormatXML(0, #PB_XML_ReFormat)   ; ** This command changes the text of the
                                 ;    nodes, which should not happen! **

SaveXML(0, GetTemporaryDirectory() + "test1.xml")
FreeXML(0)

;-- re-read XML from the file
LoadXML(1, GetTemporaryDirectory() + "test1.xml")
Item = ChildXMLNode(MainXMLNode(1))
While Item
   Debug GetXMLAttribute(Item, "id") + ": '" + GetXMLNodeText(Item) + "'"
   Item = NextXMLNode(Item)
Wend 
FreeXML(1)

I agree with Kiffi: FormatXML() only should change the format, but must not change the content! So this is IMHO a bug in FormatXML().

Regards, Little John