Re: Great? idea for improving this list (was Re: [ba-ohs-talk] Freezope learning environments)
I was thinking of something simpler.
Maybe I used the wrong terminology. I'll try again.
What I was suggesting was a system that: (01)
a) Reads an email and sucks out each word in turn.
b) Each new word has a database record created, and the
locations of occurrence of the term in another related table.
Leaving aside the issue of polysemy for a moment, the
record structure would be something like
PK_ID, word_string <--relation--> FK_ID, location(s).
c) To improve the scanning process, have a subroutine that
discards the stop-words chosen, and clean the database of
d) Repeat for each mail.
e) If a word is re-encountered then only the new location for
the word is inserted in the database in the appropriate new tuple. (02)
What you then get is an index for every mail in the archive that
contains all the interesting words in all the mails in the archive and
the locations in the mails of all those words. (03)
Sophistication could be added in the read-in phase.
For example, polysemy might be attacked by some algorithm that
makes guesses about the word type based on a grammar.
Locations might be narrowed to paragraphs by chunking them beforehand.
And so on. (04)
Then if one wants, one could take all the words gathered in the lexicon
created and put in useful associations as in a topic map. (05)
----- Original Message -----
Sent: Friday, April 26, 2002 12:43 AM
Subject: Re: Great? idea for improving this list (was Re: [ba-ohs-talk]
Freezope learning environments) (07)
> On Thu, 25 Apr 2002, Peter Jones wrote:
> > The other way to do things parallels (I think) some of the
> > stuff that Chris Dent has done.
> Please note that the project that Kathryn La Barre and I are
> working on was started by Kathryn and really comes out of her
> brain. I joined in as a technical resource but then found it
> so interesting I wanted to be more involved. In case it wandered
> into obscurity I'm referring to this:
> > 1- Parse the existing archive for terms, recording locations of
> > 2- Cull out anything useless like stop-words (e.g. 'the', 'and',
> > 3- Parse any new mails against this growing index, recording
> > of terms
> > 4- Check the new terms list every now and again.
> > (Repeat 2 as necessary.)
> > 5- Make a topic map/semantic net out of the terms if you like,
> > for future uses, graphical interfaces, paraphrase searches,
> I think there is value to three different methods and it isn't
> clear which is best:
> 1 fully automated term and cluster generation
> 2 automated cluster generation with human labelling to give
> facets to messages
> 3 human tagging
> I'm not, at this time, aware of a system that will do 1 and label
> the clusters by anything other than highest frequency terms. This
> has limited value.
> 2 is what Kathryn and I are working on.
> 3 seems to be what Alex and others are suggesting.
> Doing 3 would make 2 much more valuable, the human tag could be
> one of several facets. Cluster membership could be another.
> Identification/classification of new messages compared against an
> existing archive is possible with a variety of methods. One might
> be to create a vector that represents the incoming message and
> compare it against vectors that represent pseudo-documents that
> represent the prototypes of the already generated clusters. Then
> you can say, "It is highly likely that this message is similar to
> this cluster" and tag it as such.
> Vector space models like that, though, are very easy to corrupt.
> The unrev-ii archive is full of silly little footers from eGroup,
> YahooGroups etc that can throw off the math (we have most of them
> parsed out now). Length makes a big difference (we don't have
> consistent lengths at all). A good stopword list is crucial but
> is hard to create for a list as wide ranging as this one and
> Kathryn and I will be moving into the next phase of our work in
> the middle of May. Comments on directions or tools worth trying
> are desired. This message from Peter should help us to focus
> somewhat. Others like it would be wonderful.
> I very firmly believe that augmentation != automation. If we want
> to develop tools that allow us to work better (more effectively,
> more efficiently, with more fun) the systems we develop in our
> own behaviors for interacting with the tools are as or more
> important than than the tools themselves. Alex's idea:
> is an excellent way for us to make a slight change in our own
> behavior and gain a lot of flexibility in the tools we (the group
> at large) are able to develop.
> My personal preference would be to do something uncomplicated:
> since we do not have aids to help add the keywords, we need to
> make the barrier to use as low as possible. Murray's ideas:
> are good but I might not do it if I had to type all that.
> Something simple like:
> would get the ball rolling (bootstrapping, yeah?).
> I agree that the keywords should _not_ be in the subject or we
> suffer from thread creep in bad mail readers and overly long
> subjects. In the body is where we, the people, can put them and
> read them. The computers can put them and read them anywhere, so
> we may as well put them in the body. For now. In the future there
> will be tools that let us do it, do it for us, anywhere in the
> message or out of band.
> (I imagine a document composer that allows you to compare your
> text, prior to delivery, with a large net-wide classification
> system, nominating keywords and other identifiers that you could
> accept or reject. A system that used vector space style models
> would preserve some degree of privacy (depending on how it was
> done) because the text itself would not be transmitted to the
> It is intersting to note the Usenet news messages have had a
> "Keywords:" header for a _long_ time. I don't really them
> actually being used for much, though.
> > (I must get around to reading the GATE manual.)
> What's this?
> Chris Dent <firstname.lastname@example.org>
> "Mediocrities everywhere--now and to come--I absolve you all! Amen!
> -Salieri, in Peter Shaffer's Amadeus