Did you know? #2 – SAX a 4. way to parse XML

kbmMW Pro, Enterprise and Community Edition also contains a 4th way to parse XML, namely SAX parsing.
It is, in fact, being used automatically by the 3 previous ways shown in Did you know? #1 3 ways to parse XML

A SAX parser is extremely fast in parsing XML, and can handle huge XML files, however it is more cumbersome to use, since it does not automatically build an XML DOM for you, which means you will need to make some code, that targets the specific XML file.

But I think it could be of interest to some to see how one can derive your own specific parser from the kbmMW SAX XML parser.

We are again starting out with some XML to be parsed:

const
  XML: string =
              '<?xml version="1.0"?>'+
              '<DefaultBody>'+
              '<Default>'+
              '  <Defaultcode>XML_BOLIMPORT_MAP</Defaultcode>'+
              '  <Omschrijving>Folder waar de xml bestanden staan, die gemaakt zijn door bolmate vanuit de inkoop</Omschrijving>'+
              '  <Waarde>Z:\tmp</Waarde>'+
              '</Default>'+
              '<Default>'+
              '  <Defaultcode>XML_DESTUSED_MAP</Defaultcode>'+
              '  <Omschrijving>XMLBestanden die verplaatst worden als ze klaar zijn</Omschrijving>'+
              '  <Waarde>Z:\tmp\xmlexport</Waarde>'+
              '</Default>'+
              '</DefaultBody>';

And we still want it parsed into a list of TDef objects:

type
   TDef = class
   public
      Defaultcode:string;
      Omschrijving:string;
      Waarde:string;
   end;

   TDefs = class(TObjectList<TDef>);

Method 4 – SAX based parsing

First we need to make a SAX parser that match the XML file we want to parse. It can be coded in different ways, with more or less sloppy syntax and error handling, since it is up to you, to determine when there is a logic/syntax error in the XML data.

Since a SAX parser is basically a tokenizer, all you get from it are tokens. How their internal relation is, is all up to you to figure out.
Typically it makes sense to have one or more statemachines within your derived SAX parser to keep structure of the syntax, so you know where each token you get, is supposed to be handled.

The following is a sample SAX parser that both makes rudimentary syntax check, and parses XML files structured as the above XML example:

interface 

using
  kbmMWXML;

...
type
   TSAXState = (ssNone,ssDefaultBody,ssDefault,ssSymbol);

   TSAXParser = class(TkbmMWCustomSAXXMLParser)
   private
      FState:TSAXState;
      FDefs:TDefs;
   public
      constructor Create(const AString:string; const ADefs:TDefs);
      procedure Parse; override;
   end;

implementation

constructor TSAXParser.Create(const AString:string; const ADefs:TDefs);
var
   ss:TStringStream;
begin
     ss:=TStringStream.Create(AString);
     try
        inherited Create;
        SetStream(ss);
     finally
        ss.Free;
     end;

     FDefs:=ADefs;
     Parse;
end;

procedure TSAXParser.Parse;
  procedure Error(const AText:string; const AToken:string);
  begin
       raise Exception.Create('Error parsing XML - '+AText+' - '+AToken);
  end;
var
   def:TDef;
   symbol:string;
begin
     FState:=ssNone;
     def:=nil;


     while true do
     begin
          NextToken(FState=ssSymbol);

          case TokenType of
           mwxml_tEnd:
              begin
                   break;
              end;

           mwxml_tSymbol:
              begin
                   symbol:=TokenString;
                   dec(FState);
                   continue;
              end;

           mwxml_tLineEnd:
              begin
                   continue;
              end;

           mwxml_tXMLTag:
              begin
                   if IsClosingTag then
                   begin
                        if (FState=ssNone) and (TokenName='xml') then
                           continue
                        else if (FState=ssDefault) then
                        begin
                             if TokenName='Default' then
                             begin
                                  if def<>nil then
                                  begin
                                       if FDefs<>nil then
                                          FDefs.Add(def)
                                       else
                                           def.Free;
                                  end;
                                  def:=nil;
                                  FState:=ssDefaultBody;
                             end
                             else if TokenName='Defaultcode' then
                                  def.Defaultcode:=symbol
                             else if TokenName='Omschrijving' then
                                  def.Omschrijving:=symbol
                             else if TokenName='Waarde' then
                                  def.Waarde:=symbol
                             else
                                 Error('Invalid closing tag',TokenName);
                        end;
                   end
                   else
                   begin
                        if (FState=ssNone) and (TokenName='xml') then
                           continue
                        else if (FState=ssNone) and (TokenName='DefaultBody') then
                           FState:=ssDefaultBody
                        else if (FState=ssDefaultBody) and (TokenName='Default') then
                        begin
                             FState:=ssDefault;
                             def:=TDef.Create;
                        end
                        else if (FState=ssDefault) then
                        begin
                             if TokenName='Defaultcode' then
                                  FState:=ssSymbol
                             else if TokenName='Omschrijving' then
                                  FState:=ssSymbol
                             else if TokenName='Waarde' then
                                  FState:=ssSymbol
                             else
                                 Error('Unknown value',TokenName);
                        end
                        else
                            Error('Invalid structure',TokenName);
                   end;
              end;

           mwxml_tXMLComment:
              begin
                   continue;
              end;
           mwxml_tXMLCDATA:
              begin
                   continue;
              end;
          end;
     end;
end;

As you notice all the gruntwork is happening within a Parse method. The code basically asks for next token, figures out what to do with it, and updates a simple statemachine to indicate how far down the example XML tree the parser is. It also checks if the token is an opening tag or not, or if the token is a tag or a symbol. A tag is the one that define the name of each node in the XML file. Eg. <tag></tag>. The later tag is a closing tag.

Similarly a tag like this: <tag/> which basically combines the tag and the close tag in one, usually indicating a null value tag, can be checked with IsEndTag. You can also test for if the value of the tag is null using the property IsNilTag. In the above example we do not have any of those types of tags, so we do not check for that.

If a token is a tag, it can also have attributes. Eg. <tag attr1=1></tag>. Those attributes are available within the parser in the Attribs property, or via the IndexOfAttrib or AttribValue methods.

A tags namespace can be accessed via the NameSpace propety, and if there is defined a datatype on the tag, it can be accessed via the DataType property.

The SAX parser also detects declaration type tags eg. <?xml version=”1.0″?> which you can test for using IsDeclarationTag, and markup type tags, eg. <!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>which can be checked for by IsMarkupDeclarationTag.

Back to basics… the actual call to make the parsing happening:

function ParseXML4(const AString:string):TDefs;
begin
     Result:=TDefs.Create();
     TSAXParser.Create(AString,Result).Free;
end;

Did you know? #2 – SAX a 4. way to parse XML

Bykimbomadsen

Method 4 – SAX based parsing

Related Posts:

By kimbomadsen

Related Post

Lock-Free Hash Arrays in kbmMW — A Practical Guide

Reverse-Engineering Delphi for Effective Debugging

Taming Delphi’s Unit Initialization Order — A Dependency Graph Approach

Leave a Reply Cancel reply

You missed

MIMERCode: The AI-Friendly Programming Language

Introducing theSKULD — Because Your .dproj Files Deserve Better

Release of theMIMER v1.0.1.3!

Revealing theMIMER v1.0.0