[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

[ba-unrev-talk] Instant Ontologies: The Strength of Weak Links


A few ideas rubbed together the other day, and it occurred
to me that a web crawler capable of parsing HTML pages to
find links already has enough intelligence to begin constructing
a first-cut ontology.    (01)

  Note:
  The mechanism described here may be something like the idea
  behind the Teoma search engine (http://www.teoma.com), although
  they may well have other mechanisms, in addition to this one.    (02)

The first thought was that "weak links" predict similarity much
better than "strong links". ("Strong links" describes clustered
material -- material that is in close proximity, with many 
individual links between them, as well as links to other pages, 
all of which link to each other.    (03)

In this context, it makes sense to think of a directory hierarchy as
"linked". So it's clear that a collection of pages at a company or a
college have something in common, but generally such a collection
of pages embodies *many* ontological concepts. So strongly
linked pages are not that good for identifying concepts.    (04)

But if two separate clusters have a single connection
between them -- a weak link -- then that link implies *some* kind
of similarity. That recognition then entails two further problems:
   a. Giving a name to the concept that identifies the similarity.
   b. Separating reference-type links (and other "non-similar") links
      from links that indicate similarity ("other things of this kind")    (05)

For example, on a page describing exercises, there could be
references to anatomy descriptions, and links to equipment
manufacturers, as well as links to similar exercises. Each would
be a weak link, but any similarities would be non-obvious.    (06)

The problem is to identify which links indicate "similarity". But
it occurs to me that HTML formatting may well provide enough
clues to make some good guesses.    (07)

Basically, a "weak link" page that gives a list of links is more likely
than not to be identifying an ontological concept.    (08)

The format for such concept references would be:    (09)

  1. A heading with one or two major words. For example:
      --Equipment
      --Exercise Equipment
      --Exercises
      --Authors of Note
      --Signs of the Times    (010)

  2. A short paragraph of introductory text.    (011)

  3. List items containing short paragraphs, each with one link    (012)

Of course, there are some lists that would not be useful. For
example, JavaWorld articles always end with a "Resources"
section. The concept is obviously not "resources", but is
rather the subject matter covered in the article.    (013)

Still, it would be possible to filter out the limited number of
such headings ("for more information", "further reading",
and the like, the same way that small words like "of" and
"the" would be filtered out. What's left, in the context of
the web, would be a collection of named ontological
concepts that could be reviewed and edited.    (014)

Of course, at this point the "ontology" would look like a
simple list of concepts, with no ordering or structuring.
And duplicate concepts with different names would have
to be linked, somehow.    (015)

But it could be a start. Further examination of structural
relationhips might well lead to connections within the
ontology. For example, the concept of "bicycles" is 
identified, and a "parts list" on several pages contains
a "derailleur" entry, then perhaps it would be possible to
identfiy the "derailleur is part of a bicycle" relationship.    (016)

Similarly, a book that showed up in the "resources" section
of a few pages could lead to "book x is a resource for
bicycles".    (017)

I dunno. It's an interesting possibility -- that with a modicum
of semantic knowledge, it might be possible to construct a
very sizable ontology from the contents of the web.    (018)