PureBasic Forums - English

Posted: **Fri Mar 13, 2009 11:59 am**

Little John wrote:

FormatXML(0, #PB_XML_ReFormat)   ; ** This command changes the text of the
                                 ;    nodes, which should not happen! **

The whitespace is part of the node text. How do you expect to change the document layout without adding whitespace !?

The XML lib preserves the node text (even whitespace) when parsing and when saving, so you can save your document back exactly the way it was. The purpose of the FormatXML() function is to explicitly change the layout by adding whitespace. The function does exactly what it was designed to do. There is no bug at all.

> I agree with Kiffi: FormatXML() only should change the format, but must not change the content! So this is IMHO a bug in FormatXML().

Then show me your magic way of changing the format without adding a single new character to the document. I'm really curious

Posted: **Fri Mar 13, 2009 12:24 pm**

Little John wrote:FormatXML() only should change the format, but must not change the content! So this is IMHO a bug in FormatXML().

I disagree, it's no bug.
You are basically telling FormatXML to reformat it.
What Freak could do is add another flag for a compact mode, the issue however is it would be very complicated to provide flags to match all ways people might want it to "compact" the output.
1st level nodes only, or 2nd and 3rd as well etc?
As was mentioned FormatXML is just a convenience feature, if you do not like it roll your own or use somebody elses if you have spesific needs beyond the scope of FormatXML.

Also if you read the link freak posted you'll see that it's possible to specify either "default" or "preserve" in a XML declaration.
But if the declaration is missing it's the same as default.

In other words what you want is preserve, but you also want preserve only on the <item> content, but not on the <root> content, and you'd probably want some other layout on <something> inside <item> nodes etc.

You should be aware that even with the unformatted PureBasic output of:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?><root><item id="1"> Wheelchair</item><item id="2">Skyscraper</item><item id="3">Computer</item></root>

(note the space in front of Wheelchair)
Any apps that read in that will most likely strip away that space, there is no guarantee that leading and trailing space is retained across apps.

Despite XLM's intentions (compared to HTML) you still need to be liberal in what you read in. (XML was created to be strict in that regard but it isn't sadly)

You say you do not want the content of a node to be changed (leading/trailing space etc), but you do want it to add a TAB in front of the item notes, but that in itself DOES change the content of the <root> by adding a TAB that is not really part of the content.
It's basically a catch-22 dilemma.

I'm sure that freak could add a check for xml:space in FormatXML so it keeps nodes with the declaration untouched but re-formats the rest, but this would bloat your xml a lot more obviously as that would have to stay there in the formated output as well (to ensure other apps/tools) do the same.

If you plan to let "humans" read and edit the output then you should be prepared to read back in a lot of weird things (extra spaces, tabs, or missing ones even).
If it's machine only then keeping it the very tight unformatted is the best really.

I'm sorry if this isn't helping but, the flaw is in XML and how everyone treat/interpret the specification, and W3C can't chnge this since it's "how things are done", probably the reason the added the xml:space extension to solve this issue.

You could always use CDATA though I guess...Not an ideal solution either.

Posted: **Fri Mar 13, 2009 1:23 pm**

freak wrote:Then show me your magic way of changing the format without adding a single new character to the document. I'm really curious

Well, FluidByte put an example in his very first post, The white space was removed in his example of how he thought it should look.

Sandwiching the data in between the tags like he showed would be better wouldn't it?

Posted: **Fri Mar 13, 2009 7:43 pm**

> Well, FluidByte put an example in his very first post, The white space was removed in his example of how he thought it should look.

His example adds whitespace to the main node too. So it is "buggy" as well.

My offer still stands, if you can clearly define what you want then i will try to implement it.

The current rules are:
- cut multiple consecutive newline or spaces
- newline in front and after every tag (opening and closing)
- add space for proper indentation

Keep in mind though that the function works both ways: It does not only add whitespace, it also removes it if there is more than is needed, and this is sometimes the harder part.

Take this example:

Code: Select all

<a><b>Some Text
</b></a>

How do you reformat that ? Like this ?

Code: Select all

<a>
  <b>Some Text</b>
</a>

But how do you know if that newline in node b was important ? So do we only add/remove newline in nodes without non-whitespace text ? Will that generate the structure you want, or does this lead to more problematic cases ? Will the output not look very weird if the document is full of nodes like the above example ? Since the XML spec defines that only the application and not the parser/library can decide which whitespace is important, this is not a trivial thing to solve.

The rules i implemented add a little much whitespace, i know that. But at least the end result looks consistent no matter what the document looked like before it was reformated, and that was my primary goal.

So the offer stands, if you can define a clear set of rules (not just "this is how it should look") that work well both for documents with very few whitespace before the call but also for documents that looked pretty messed up before the call (because the function must handle these cases too) then i will do my best to implement them.

Its not as trivial as it looks, believe me.

Posted: **Sat Mar 14, 2009 1:37 am**

Code: Select all

If Element Contains Sub Element
    CRLF after open and close tag
Else  
    CRLF after close tag only
EndIf

Once this is done, tabs/spaces for indenting can happen with the same nesting rules currently without landing in the data.

Posted: **Sat Mar 14, 2009 2:29 am**

And what about this then?

Code: Select all

<a>Some
Text<b>Some
More Text
</b>Even More
Text</a>

That would become:

Code: Select all

<a>
  Some
  Text
  <b>Some More Text
  </b>
  Even More
  Text
</a>

When it really should be:

Code: Select all

<a>Some
Text<b>Some More Text</b>Even More
Text</a>

The XML standard's "default" behavior currently used is correct as far as I see it.
As I said earlier, only the use of the xml:space declaration to specify the content should be left untouched on nodes can guarantee they are, otherwise it is totally up to the apps reading/writing on how they should treat spaces and linebreaks and tabs.

Hey freak, does FormatXML respect the xml:space="preserve" at all?
That is the only way to handle this "mess" in a way that matches the XML standard as far as I can tell. It's also less likely to have issues when exchanging xml with other apps or human edited xml as well.

Oviously this puts the responsibility of content preservation and intent ito the hands of the author of the XML which may not go well with lazy coders

But never the less, it is the proper way to do it.

Posted: **Sat Mar 14, 2009 2:37 am**

Found this Googling, check out the
http://www.oracle.com/technology/pub/ar ... space.html

freak, I guess I can assume that FormatXML is not an XSLT formater?

So the only advise I can give to the original post is that, FormatXML is not what you want, you need a XSLT formater to do what you want as its' way beyond the scope of FormatXML.

Posted: **Sat Mar 14, 2009 3:02 am**

@Rescator (I'm showing my ignorance here I'm sure) is that valid XML? Raw data and elements in the same element? isn't avoiding that the whole reason for attributes?

Looking through wikipedia etc, I can't find any examples that show XML like that

Posted: **Sat Mar 14, 2009 7:11 am**

Sure it's valid, but rare obviously, but an example would be XHTML.
And I'm sure there are plenty applications and document formats out there that use XML in unusual ways like this.

Also, take a look at GetXMLNodeText() in the manual, it even support retrieving that content (minus the child nodes and their content).

To be honest though, I find XML rather flawed and ugly (why I love HTML5 so much compared to XHTML), and I usually end up using ini files or custom formats for importing/exporting data from my apps, I avoid the XML parsing overhead as well

Posted: **Sat Mar 14, 2009 3:30 pm**

Rescator wrote:..., I avoid the XML parsing overhead as well

That's all I wanted to hear!

Posted: **Sun Mar 15, 2009 1:43 pm**

freak wrote:The function does exactly what it was designed to do. There is no bug at all.

You are right. I'm sorry for having written that there is a bug in FormatXML().

freak wrote:My offer still stands, if you can clearly define what you want then i will try to implement it.

So here is my proposal.

Regards, Little John

PureBasic Forums - English

Formating XML files