XTRAN Example — Cross Link & Index HTML Documents


Anatomy of a Complex HTML Document with Variants

A document (or Web site) written in HTML may comprise many "chapters" (HTML source files), typically front-ended by a title page / Table of Contents that links to all of the chapter files and an index.  An example is the XTRAN User's Manual.

In some situations, we may have variants of the document, in which not all chapters participate in every variant.  For instance, in the case of the XTRAN User's Manual, each XTRAN licensee receives a variant that is tailored for the licensed activity.  So someone licensing XTRAN for, say, translation of PL/I to C++ will get a variant of the XTRAN User's Manual that contains only those chapters relevant to that activity.  Someone else licensing XTRAN for, say, analysis of mainframe assembler will receive a different variant, containing a different selection of chapters.

Each variant must have its own title page and a table of contents that links to the chapters participating in that variant.  The variant must also have an index showing only those index entries that are in the participating chapters..


Objective:  Cross Link and Index HTML Documents

For maximum convenience of use, we would like for each occurrence of a significant term anywhere in a variant of the document to be cross linked to that term's definition (a practice widely seen in Wikis).  That definition may well be in a different chapter than the reference.  However, we don't want to cross link the occurrence of a term whose definition is in a chapter not included in our variant.  We also don't want to cross link a term occurrence that's in close proximity to the term's definition.

In addition, we want to have a thorough alphabetical index at the end of our document.  Obviously, it should include only entries whose bookmarks are in chapters included in our variant.  For the document reader's convenience, we would also like for each index entry to show the section and chapter in which it occurs; this is commonly known as a Key Word in Context (KWIC) index.  And we would like to have, at the index's start, a letter-by-letter index to the index.

As any professional indexer will tell you, the hard work is deciding which terms in the document to index.  Although many attempts have been made to automate this process, with some success (including with XTRAN ), the only way to guarantee a really good job is for a knowledgeable human being to carefully control the indexing process and "fine tune" the final product.

However, once we have inserted the index entries into the document's chapters, we can automate the rest of the work — cross referencing item occurrences in a document variant's chapters, and generating the document variant's index.  We can also automate generating each variant's title page / Table of Contents.


How XTRAN and TemplaGen Can Help

HTML is one of many computer languages XTRAN manipulates. XTRAN represents each tag, segment of nonmarkup text, and end tag as an HTML "statement", and represents each attribute of a tag as a "statement attribute", possibly with a value (following =).

XTRAN's Internal Representation (XIR) of HTML is essentially the same as for all other computer languages XTRAN manipulates, including assemblers, 3GLs such as COBOL and C++, 4GLs such as RPG and Natural, meta-data languages such as XML, scripting languages, Web languages, data base languages such as SQL, and special-purpose languages.  This means that the full power of XTRAN's rules language is available to manipulate HTML.

This example employs 1,068 code lines of XTRAN's rules language (which we also call "meta-code").  It shows how we can use several versions of XTRAN to automatically cross link and index all occurrences of specified text items.  The example assumes that the target to which each item is to be cross linked and indexed is marked with a bookmark that has an appropriate name.  Such a bookmark may or may not enclose related text.

We use two (illegal) HTML <a> attributes in the bookmarks, nolink and noindex, to control, in the original HTML files, whether each bookmark is cross linked and/or indexed.  We remove these attributes in the process of cross linking, so they don't show up in the final document variant.

We also use a set of XTRAN styling rules for rendering HTML that specifies, for each tag, whether it and its end tag (if any) are to get preceding and/or following line breaks in the output.

How can such powerful and generalized HTML manipulation be automated in only 1,068 code lines of XTRAN rules?  Because there is so much capability already available as part of XTRAN's rules language.  These rules take advantage of the following functionality:

TemplaGen, our no-code / low-code template-driven and data-driven artifact generator, provides the means to automate the generation of the artifacts we need — both HTML files and other text artifactsTemplaGen is itself written in XTRAN's rules language.

Strategy

Our strategy is as follows: