More XML #1 – Mixed text documents – Improved whitespace handling

kbmMW’s native XML parser has been with us for many years now, and it has done a really great job in quickly parsing and producing XML documents.

Over the years, it has been improved and enhanced both feature wise and performance wise, and it will continue to evolve in the future I’m sure.

This blog will talk about some improvements that has been made, due to input from a customer who have reported an issue where the formatted output of the XML generator had unneeded extra line separators. Although it was a fairly harmless problem, since the XML was correct anyway, it lead me to look into the problem, which resulted in discovering that kbmMW’s XML parser/generator was missing a feature, namely the ability to correctly parse and generate mixed text XML files.

Contents

1 Mixed text documents
2 Whitespace
3 Compact documents

Mixed text documents

A mixed text XML could look like this:

<a>Word 1<b>Word 2</b>Word 3</a>

As can be seen, it on outset looks like a regular typical XML document, but looking closer, one can see it is not, because node a contains both text and a child node.

Previously kbmMW’s XML parser/generator did read this type document, but the text “Word 3” would be gone due to the way it operated.

Next release of kbmMW contains improvements that allow for reading and writing mixed text documents. The above XML will result in a DOM built like this:

As usual kbmMW continues to support converting this to (and from) different object notation formats like JSON, YAML, BSON and MessagePack.

Whitespace

As mentioned the deeper dive into kbmMW’s already mature XML parser/generator started out with a customer letting me know it had issues where additional line breaks were added. The line breaks where not as such dangerous as the XML continued to be valid for absolutely most uses, but it did indicate a problem, which I dived into.

For that reason, kbmMW’s XML parser/generator’s use of the PreserveWhitespace boolean property (default false) has been improved so it is able to faithfully represent the white spaces of loaded documents, when saved. The drawback is that if you add new nodes to a loaded document, with PreserveWhitespace set to true, kbmMW will not automatically add additional line feeds or pretty styling. It is your own responsibility to do that via the Data property of each node.

Further kbmMW TkbmMWCustomDOMXMLParser now also contains a boolean property named CollapseNil (default false) which, if set, will automatically collapse nodes of the style: <node></node> to <node/> as they are essentially the same according to the XML standard.
Previously it would automatically collapse such nodes when saving a previously loaded document.

Finally the integer property AutoIndentCount is introduced. It has a default value of 2, and controls the number of spaces to use for each indentation level if PreserveWhitespace is false. Previously the hard coded value of 2 were used.

Compact documents

kbmMW XML has for a long time contained a probably less known feature that enables you to save lots of memory when loading documents that contains many repeated strings. The strings can be the name of an element, attribute, namespace, reference, ID, type and/or data.

Instead of storing the string multiple times for each node, a much more compact but still quite fast way is used, where all textual data is stored in a fast lookup list that is common for the whole document.

That means that each node will save significantly amount of data, and for large complex, but repetitive documents, a quite substantial memory use saving will be the norm.

In addition the upcoming release shaves some extra bytes of each node for flag handling. This happens regardless of using compact mode or not.

To use the compact mode XML parser/generator, make sure to add {$DEFINE KBMMW_COMPACT_XML} to kbmMWConfig.inc, then recompile kbmMW and the projects that use the kbmMW’s XML handling.

More XML #1 – Mixed text documents – Improved whitespace handling – Compact mode

Bykimbomadsen

Mixed text documents

Whitespace

Compact documents

Related Posts:

By kimbomadsen

Related Post

Enhance Your Code with MimerCode and Delphi Linters

Introducing theSKULD — Because Your .dproj Files Deserve Better

Lock-Free Hash Arrays in kbmMW — A Practical Guide

Leave a Reply Cancel reply

You missed

Enhance Your Code with MimerCode and Delphi Linters

The new Components4Developers products portal is being implemented

MIMERCode: The AI-Friendly Programming Language

Introducing theSKULD — Because Your .dproj Files Deserve Better