XTRAN Example — Normalize Tags and Vocabulary in XML

Scenario — you are receiving XML from multiple sources, and although the structure is the same, each source has its own tags and vocabulary.  You need to normalize terms across those multiple sources.

XTRAN to the rescue!

The following example uses an XTRAN rules file comprising 167 non-comment lines of XTRAN's rules language ("meta-code") to process XML and normalize vocabulary in it, substituting preferred tags and terms for synonyms.

The rules took 1½ hours to write and one hour to debug.  (That's right, only 2½ hours total!)

The XTRAN rules automate the normalization of both tags and terms in XML.  You can specify, for each tag or tag value, a preferred tag or term and a set of synonyms for it.  The rules will then, for each specified XML tag or field, change all synonyms to the preferred tags or terms.

You specify, via a "synonyms" file, preferred tags and terms, and synonyms for them, in the following formats.  Empty lines and lines starting with ; are ignored.

To normalize an XML tag:

TAG,<prftag>,<syntag>[[,...]]

where:

      <prftag>  is the preferred tag name
      <syntag>  are one or more synonyms for <prftag>

Note that this automatically accommodates both tags with values and empty tags.

To normalize an XML field:

TEXT,<tagnam>,<prfval>,<synval>[[,...]]

where:

      <tagnam>  is the name of an XML tag whose text value is to be normalized
      <prfval>  is the preferred value for <tagnam>
      <synval>  are one or more synonyms for <prfval>

Note that the rules automatically accommodate XML with missing tags.

Here is an English paraphrase of the XTRAN rules:

    Open synonym specifications file read-only
    For each input line
        Parse TAG or TEXT specification, store in data base
    Close synonym specifications file
    For each XML "statement"
        If it's a tag statement
            If tag name is a synonym for a preferred tag name (direct DB lookup)
                Change the tag's name
            If tag's value is text
                If value is a synonym for a preferred value (direct DB lookup)
                    Change the tag's value to the preferred value

How can such powerful and generalized XML processing be automated in only 2½ hours and 167 lines of XTRAN rules?  Because there is so much capability already available as part of XTRAN's rules language.  These rules take advantage of the following functionality:

The input to and output from XTRAN are untouched.



Process Flowchart

Here is a flowchart for this process, in which the elements are color coded:

data flowchart

Input to XTRAN — synonym specifications:

TAG,<country>,<location>,<venue>
TAG,<age>,<years>
TAG,<virtual>,<remote>,<offsite>
TEXT,<country>,United States,USA
TEXT,<status>,employed,active,working



Input to XTRAN — XML to process:

<employee>
    <firstname>Arthur</firstname>
    <lastname>Jones</lastname>
    <location>United States</location>
    <years>25</years>
    <status>active</status>
    <virtual />
</employee>
<employee>
    <firstname>Gertrude</firstname>
    <lastname>Murgatroyd</lastname>
    <venue>France</venue>
    <age>40</age>
    <status>employed</status>
    <offsite />
</employee>
<employee>
    <firstname>Agnes</firstname>
    <lastname>Baldwin</lastname>
    <location>USA</location>
    <status>retired</status>
</employee>
<employee>
    <firstname>Fred</firstname>
    <lastname>Smith</lastname>
    <country>Germany</country>
    <years>19</years>
    <status>working</status>
    <remote />
</employee>



Output from XTRAN — normalized XML:

<employee>
    <firstname>Arthur</firstname>
    <lastname>Jones</lastname>
    <country>United States</country>
    <age>25</age>
    <status>employed</status>
    <virtual />
</employee>
<employee>
    <firstname>Gertrude</firstname>
    <lastname>Murgatroyd</lastname>
    <country>France</country>
    <age>40</age>
    <status>employed</status>
    <virtual />
</employee>
<employee>
    <firstname>Agnes</firstname>
    <lastname>Baldwin</lastname>
    <country>USA</country>
    <status>retired</status>
</employee>
<employee>
    <firstname>Fred</firstname>
    <lastname>Smith</lastname>
    <country>Germany</country>
    <age>19</age>
    <status>employed</status>
    <virtual />
</employee>