[unrev-II] Source Code in XML: Data and Editing Requirements

From: Eric Armstrong (eric.armstrong@eng.sun.com)
Date: Sat Jan 29 2000 - 22:21:01 PST


From: Eric Armstrong <eric.armstrong@eng.sun.com>

Here is a modified version of the requirements I originally
produced for the source in XML project. The gist of it is that
it can be handled without using any special node types. (This
is a good thing. Otherwise, you wind up defining a language
that is equivalent to the original -- and which must stay
in step with it.) As a result, the "smarts" in the system
migrate to the XML-to-plain-text filter, until such time
as an XML-aware compiler is constructed.

Apologies in advance for the fact that this should probably
go to an as-yet uncreated DKR working group sublist, instead
of the main list. (I suspect tonight's traffic will help
motivate such a subgroup...)

-------Original Message-------
Subject: Double Containment

For a literate style of programming in a hierarchical
system, comments need to "double contain" both other comments
and code. You want the ability to collapse the hierarchy
so you see only comments, and expand it at certain points
to see the code tucked under them. That style produces
a more readable, better documented program, because code
that can't be collapsed into a comment sticks out like a sore
thumb.

At the same time, some comments are large, so they need to
take advantage of the tree structure and the ability to
collapse things, as well. So a comment needs the ability
to "contain" both code and comment elements.

One way to do that is create a meta-node that contains a
comment substructure and a code substructure. But that makes
the editing task very difficult -- the meta node must be
invisible, and the comment node must appear to the user
to be the parent of the code. That requirement in turn
implies the need for an "Editor Stylesheet" to tell the
editor what to do with the nodes in the system.

But it makes a lot more sense to use existing standards,
and avoid creating new ones. So it would be more desirable
to use a more natural mapping of source text to an
XML source tree.

At the same time, we want to avoid creating different nodes
for every language construct. That ties the editor to a
particular language and creates an editor that is harder to
learn and harder to use. It is therefore relatively clear
that right tree structure for storing source code will be
basically language-independent, so that any XML editor the
developer is familiar with can be used.

Single Node Type
----------------
The bottom line is that the XML source tree really needs only
*one* kind of node. Call it "line", or "code" or something.
The smarts are all embedded in the XML-to-text processor,
which means that different processors can handle different
languages.

At this point, the smarts should handle end-of-comment marks,
braces, and semi-colons. Parentheses should be ignored, at least
for now.

Semi-colons are easiest. One is supplied at the end of an
entry, if it doesn't already exist. (Checking for comments embedded
in the line is the only tricky part.)

For if statements and the like, braces the filter should expect
to supply braces, if they don't already exist. Again, working
around comments is the tricky bit. But the structuring information
makes it possible to identify endpoints.

An entry that starts with // is terminated at the end of line
automatically. If the output is wrapped, multiple //'s are supplied.
What becomes an entire paragraph of explanation is therefore
easily marked as a comment with two characters.

Since we know where // comments end, we can terminate an
entry that starts with /* at the end of its sublist.
Subentries with /* should be converted to // when generating
output for the compiler, making it trivial to comment out entire
blocks of code.

When generating output for putting back into a source control
system, though, it's not clear that such a "destructive"
implementation is ideal. We'd have to convert the semantics we
recognize easily ("/*" == "comment-out-entire-block") into
something which has the equivalent effect when read by a person,
and which can be deconstructed into the original form when
filtered back into the editor.

The really interesting comments are the Java /** comments.
Since they can be up to a page or more in length, they too should
terminate at the end of the sublist so that the advantages of the
hierarchy accrue to them.

On input, each /** entry needs to become a CDATA section
so that any embedded HTML tags are ignored. All subentries must
become CDATA sections as well. It will take some fancy HTML
parsing to identify the right subelements to create. (Personally,
I would prefer source code to have links to documents. But Java
was designed in a world ruled by flat-text editors, rather than
HTML or XML, so keeping the comments in place seems like the
right idea.)

This strategy eases the development of the editor, though it complicates
the input and output processing. Most of all, though, it allows most
any editor to be used, for most any language, given the appropriate
filters. (To find errors, though, the output processor will have to
record line numbers in the structure, and the editor will need the
ability
to go to them.)

Structure Disconnect Issues
---------------------------
The system won't be "bulletproof". Two "structure-disconnect"
issues remain. The advantage of hierarchies is automatically
keeping things together that belong together. There are a couple
of cases, though, where a separation can occur -- always by
user action, of course, but an undesirable separation nonetheless.

The first issue is that an else-clause can be disconnected from
the if clause, if it is coded as:
    + if ...
        ...code...
     +else
        ...code...

You could then move the if-clause and leave the else-clause behind.
Still that's an easy error to diagnose and fix, and it's one you could
make with a plain text editor, as well. Then, too, the whole issue
can be sidestepped by virtue of coding style:
   + if (expression)
      + //then <explanation>
         ...code...
      + else // <explanation>
         ...code...

Both clauses now tuck under the if-statement nicely, and each is
explained as well, for a bit more literacy in the program.

The other issue is the separation of /** comments and the code
they attach to. The strategy outlined above produces code
like:
    + /** Big Method
          - This method returns an integer
          - that is the next prime number
          - starting from its input argument...
    + int nextPrime(int n)
          - // Implement Erasthones' Sieve
          - for (int i=0; ...)
             - // Is i prime?
             - ...

When compacted, it is clear that two elements exist side by
side which clearly belong together:
    + /** Big Method
    + int nextPrime(int n)

That makes it possible to to move one and leave the other
behind. But I suspect we can live with that. (In a hierarchical
system, you quickly learn to collapse the view before making
selections, to avoid exactly this kind of problem.)

Again, coding style might also step to the rescue:
    + // Big Method
       + /** This method returns an integer....
       + int nextPrime(int n)

Now the documentation and code are side by side, but both are
contained under a common heading.

Compensating for Manual Terminators
-----------------------------------
The one hairy part of the XML to plain text translator is compensating
for manually-entered terminators -- especially */ and ending-brace.

An existing language parser can be modified to generate SAX events, and
then plugged into Sun's DOM parser to generate a DOM from a plain
source.
However terminators are handled can be reflected in the data structure
that results. It is a simple matter to drop the semi-colons, end-comment
marks, and braces while parsing. The structure already implies where
they are needed, and they can be automatically supplied on output.

The problems begin with the editing. There is nothing to *prevent* a
user from an entering an ending-brace, for example. At least that is
true in the normal XML editor. But to disallow and ending-brace in
source code and yet allow it in a comment, for example, is to require a
grammar-sensitive editor. Experience shows that such beasts are too
limited to become widespread, often unweildy, and often terribly
constraining.

So the user can't reasonably be prevented from entering terminators any
more than they can be preventing from typing bad code. That means the
XML-to-plain-text translator has to be alert for the problem -- which
greatly increases the complexity it has to deal with. (I'm open to other
solutions. But that's the only one I see at the moment.)

--------------------------- ONElist Sponsor ----------------------------

GET A NEXTCARD VISA, in 30 seconds. Get rates as low as 0.0 percent
Intro APR and no hidden fees. Apply NOW.
<a href=" http://clickme.onelist.com/ad/NextcardCreativeCL ">Click Here</a>

------------------------------------------------------------------------

Community email addresses:
  Post message: unrev-II@onelist.com
  Subscribe: unrev-II-subscribe@onelist.com
  Unsubscribe: unrev-II-unsubscribe@onelist.com
  List owner: unrev-II-owner@onelist.com

Shortcut URL to this page:
  http://www.onelist.com/community/unrev-II



This archive was generated by hypermail 2.0.0 : Tue Aug 21 2001 - 18:56:41 PDT