Eric Armstrong wrote:
> Eugene Eric Kim wrote:
> > On Fri, 13 Apr 2001, Murray Altheim wrote:
> > > If an editor adds a third paragraph to section four, what was,
> > > say, "D3 (12)" now becomes "D4 (13)". Links made to this
> > > paragraph in earlier versions of the document now point to
> > > the newly added paragraph.
> > Nope. The proper behavior should be that "D3 (12)" becomes "D4 (12)".
> The existence of static, unique IDs makes the linking problem
> as easy as "possible" to solve in the face of evolving documents.
> There are a number of corollaries:
> 1) The ID should ideally be globally unique.
> That requirement allows me to modify a document, while you
> do likewise, and merge our additions with a minimum of
> overlap problems. (The fact that we both created nodes in
> the same place creates a decidable problem: Should your
> change come before or after mine. But if we both create a
> node with the same ID, the issue is realistically
> undecidable: Who gets the original ID, who gets a different
> ID, and what happens as a consequence of changing the ID
> you *thought* you had created, if you are the loser.
I'm not sure how to understand the approach you've taken regarding
node identifiers. By definition, at least in SGML and XML, IDs are
always unique within a document. There are no "winners" or "losers";
that concept frightens me. IDs are simply syntactical devices that
should never be considered *part* of a document's content. As such,
the idea that they are valuable outside the confines of a single
document is incorrect. An ID is valid and functional for a specific
document. If we go back to the concept of FPIs, nobody would ever
(unless they didn't understand the concept) republish a different
document using the same FPI. If two documents are merged, the idea
that either author "wins" seems to me strange; reprocessing of such
content will likely recreate the entire ID namespace (and perhaps
should, just to remove any confusion).
Following on from what I consider some very erroneous concepts
derived within the W3C on names and addresses we now have a real
muddle: documents "identified" by their URLs that may (and often
do) change fundamentally. There's no guarantee whatsoever that a
link to an ID on the AT&T home page (for example) will be stable,
server-side processing becoming more common, even the ability to
can turn off access to an entire site.
But enough of the rant. The point here is that link integrity is
something that can't be guaranteed at a syntax level. I do believe
there are some fundamental rules -- that if everyone were to play
well with others and follow those rules we might increase link
integrity. But the concept of globally unique should come not with
the ID alone but in congress with the document identifier. The
document identifier should *not* be its URL, but some kind of
identifier that is a combination of its *name* and *version*. This
metadata can be stored either within the document or external to
With the advent of XPath and XPointer, IDs will become only one
of a number of access methods. The above concept of publication
with *name* and *version* will become even more important.
The guarantee of ID integrity is only provided for a specific
document identifier, although perhaps some indication of ID
integrity could be provided as well. But given the history of
the Web and the lack of any real guarantees, it's better to
devise a system that *inherently* treats alterations of documents
as *new* documents. This could, for example, be a combination
of a document's URL or URN and its CVS identifier, or some such
> 2) A hierarchical link may be *displayed* as A.1.B.4, but it
> can be internally *stored* as a root document (to define
> the context) and a unique ID (for the node). When that link
> is transmitted, it is the root and ID which is transferred.
> At display time, the hierarchical path is *constructed* from
> the root to the uniquely-specified node, and that path is
I think it's dangerous to create "hierarchical" links that are
in some way different from what they *mean*, and they do have
meaning: "A.1.B.4" means "Section A, subsection 1, sub-sub-section
B, sub-sub-sub-section 4." With the fact that the web is a very
poor guarantee of formatting or structure, that different
stylesheets may be applied in different contexts, it's often
difficult for people to ascertain such structures.
> 3) The alternative is nasty, but possible: The hierarchical
> link is stored, along with the unique ID. Every change to
> the organizational structure causes a ripple of:
> a) All nodes affected by the change in a given document
> b) All links that reference those nodes
> The hierarchical paths are then modified in all of the
> links which reference affected nodes. Ugly, ugly, ugly.
> Especially in a distributed environment. But possible.
Perhaps a less ugly alternative exists: with the publication of
any document update (which would be published under a different
identifier), the mapping between old and new IDs could be provided
along with the status of the nodes being identified (which may
include textual changes, etc. within the ID-identified node). A
*fairly* simple tool that could process such metadata isn't out
of the realm of reason.
Sorry if this is a bit of a brain dump, and that my response is
so Web-centric. Like many of us who have watched the evolution of
the Web, I've become somewhat cynical about the integrity of the
system, and my feeling is that it's better to design into a
document management system its known (and likely unfixable) flaws.
Murray Altheim <mailto:firstname.lastname@example.org>
XML Technology Center
Sun Microsystems, Inc., MS MPK17-102, 1601 Willow Rd., Menlo Park, CA 94025
In the evening
The rice leaves in the garden
Rustle in the autumn wind
That blows through my reed hut. -- Minamoto no Tsunenobu
This archive was generated by hypermail 2.0.0 : Tue Aug 21 2001 - 17:58:05 PDT