Did you know? #2 – SAX a 4. way to parse XML

WikiCommons

kbmMW Pro, Enterprise and Community Edition also contains a 4th way to parse XML, namely SAX parsing.
It is, in fact, being used automatically by the 3 previous ways shown in Did you know? #1 3 ways to parse XML

A SAX parser is extremely fast in parsing XML, and can handle huge XML files, however it is more cumbersome to use, since it does not automatically build an XML DOM for you, which means you will need to make some code, that targets the specific XML file.

But I think it could be of interest to some to see how one can derive your own specific parser from the kbmMW SAX XML parser.

We are again starting out with some XML to be parsed:

And we still want it parsed into a list of TDef objects:

 

Method 4 – SAX based parsing

First we need to make a SAX parser that match the XML file we want to parse. It can be coded in different ways, with more or less sloppy syntax and error handling, since it is up to you, to determine when there is a logic/syntax error in the XML data.

Since a SAX parser is basically a tokenizer, all you get from it are tokens. How their internal relation is, is all up to you to figure out.
Typically it makes sense to have one or more statemachines within your derived SAX parser to keep structure of the syntax, so you know where each token you get, is supposed to be handled.

The following is a sample SAX parser that both makes rudimentary syntax check, and parses XML files structured as the above XML example:

As you notice all the gruntwork is happening within a Parse method. The code basically asks for next token, figures out what to do with it, and updates a simple statemachine to indicate how far down the example XML tree the parser is. It also checks if the token is an opening tag or not, or if the token is a tag or a symbol. A tag is the one that define the name of each node in the XML file. Eg. <tag></tag>. The later tag is a closing tag.

Similarly a tag like this: <tag/> which basically combines the tag and the close tag in one, usually indicating a null value tag, can be checked with IsEndTag. You can also test for if the value of the tag is null using the property IsNilTag. In the above example we do not have any of those types of tags, so we do not check for that.

If a token is a tag, it can also have attributes. Eg. <tag attr1=1></tag>. Those attributes are available within the parser in the Attribs property, or via the IndexOfAttrib or AttribValue methods.

A tags namespace can be accessed via the NameSpace propety, and if there is defined a datatype on the tag, it can be accessed via the DataType property.

The SAX parser also detects declaration type tags eg. <?xml version=”1.0″?> which you can test for using IsDeclarationTag, and markup type tags, eg.  <!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&#8221;>which can be checked for by IsMarkupDeclarationTag.

Back to basics… the actual call to make the parsing happening:

326 total views, 6 views today

Author: kimbomadsen

Leave a Reply

Your email address will not be published. Required fields are marked *

9 − 3 =

This site uses Akismet to reduce spam. Learn how your comment data is processed.