XTRAN Example — Analyze Straight-line Code Length by Nesting Depth

In structured computer languages, a sequence of statements at the same nesting level is known as a "straight-line code" (SLC) segment or run.

Code quality

One property of SLC runs is that, if one gets very long, the code becomes harder to understand, because:

So an analysis of SLC run lengths can be a valuable code quality measure.

Forensic Analysis

Another interesting property of SLC runs is that, if someone copies a body of code and attempts to conceal the theft by making minor modifications to the code or otherwise obfuscating it, the thief almost never disturbs the code's structure, since that would require re-engineering and debugging.  As a result, the frequency of SLC run lengths at given nesting depths is unlikely to change appreciably.

Similarly, if a nefarious hacker copies existing malware and changes it to avoid detection by anti-malware facilities, the chances are that the copy's distribution of SLC run lengths by nesting depth will not change significantly.

So an analysis of two bodies of code that results in an identical or highly similar distribution of such occurrences is very likely to indicate code copying.  It is extremely unlikely that independently developed code would be that close.  This means that such analysis can be a valuable forensic analysis tool when looking for code theft or common code authorship.

Example

The following example uses a set of XTRAN rules comprising 59 non-comment lines of XTRAN's rules language ("meta-code").  The rules took ½ hour to write and ½ hour to debug.  (That's right, only 1 hour total!)

The XTRAN rules used for this example are totally language-independent; they can be used unchanged for any language XTRAN accommodates.

These XTRAN rules analyze code (C in this example) and tally the occurrences of SLC run lengths by code nesting depth.  They then output the results as delimiter-separated values (DSV), which is a format suitable for input to a spreadsheet, data base, or statistics software, in the following format:

<dpth><dlm><lgth><dlm><tally>

where:

    <dlm>  is the DSV delimiter character; defaults to comma
     <dpth>  is the code nesting depth, in numeric order
     <lgth>  is the SLC run length, in numeric order within <dpth>
     <tally>   is the tally of <lgth> runs at <dpth>

The following is an English paraphrase of the rules:

For each statement
    If it is at the end of an SLC
        Get SLC's run length (statements)
        Remember that this length occurred at this nesting depth
        Increment tally for this length at this depth
Create or append to output file
For each code nesting depth we saw, in numeric order
    For each SLC run length we saw at this depth, in numeric order
        Output this run length's tally for this depth
Close output file

The actual analysis work is done by the initial "For" statement, which literally comprises 5 statements in XTRAN's rules language.

The code mining rules for this example can easily be enhanced to produce DSV output that can be interactively queried using existing XTRAN rules.

How can such powerful and generalized code quality analysis be automated in just 1 hour and only 59 code lines of XTRAN rules?  Because there is so much capability already available as part of XTRAN's rules language.  These rules take advantage of the following functionality:

The input to and output from XTRAN are untouched.



Process Flowchart

Here is a flowchart for this process, in which the elements are color coded:

data flowchart

Input to XTRAN:

void proc1(void)
        {
        short s1, s2;
        long l1, l2;

        for (s1 = 0; s1 < 5; ++s1)
            {
            s2 = 1;
            s2 = 2;
            if (l1 > 3)
                {
                l2 = 1;
                l2 = 2;
                l2 = 3;
                }
            s2 = 3;
            s2 = 4;
            s2 = 5;
            if (l1 > 3)
                {
                l2 = 1;
                l2 = 2;
                l2 = 3;
                }
            }
        return;
        }
void proc2(void)
        {
        short s1, s2;
        long l1, l2;

        for (s1 = 0; s1 < 5; ++s1)
            {
            s2 = 1;
            s2 = 2;
            if (l1 > 3)
                {
                l2 = 1;
                l2 = 2;
                }
            s2 = 3;
            s2 = 4;
            s2 = 5;
            if (l1 > 3)
                {
                l2 = 1;
                l2 = 2;
                l2 = 3;
                }
            }
        return;
        }


Output from XTRAN:

0,2,1
1,6,2
2,9,2
3,4,1
3,5,3

Importing those results into OpenOffice Calc, we quickly created the following bubble chart:

bubble chart