[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: Great? idea for improving this list (was Re: [ba-ohs-talk] Freezope learning environments)

cdent wrote:    (01)

> [archive_access.practical]
> On Fri, 26 Apr 2002, Peter  Jones wrote:
>>What I was suggesting was a system that:
>>a) Reads an email and sucks out each word in turn.
>>b) Each new word has a database record created, and the
>>locations of occurrence of the term in another related table.
>>Leaving aside the issue of polysemy for a moment, the
>>record structure would be something like
>>PK_ID, word_string <--relation--> FK_ID, location(s).
>>c) To improve the scanning process, have a subroutine that
>>discards the stop-words chosen, and clean the database of
>>d) Repeat for each mail.
>>e) If a word is re-encountered then only the new location for
>>the word is inserted in the database in the appropriate new tuple.
> In what ways are you imaginging this being different from a free
> text index of the mail archive that gets reindexed every time a
> new message comes in?    (02)

There's been quite a lot of research done in this field for a long
time, with many individuals and organizations providing valuable
full text search tools. This is quite different than having an
author manually select and include keywords describing the content,
but doesn't require an author to do so. Every time I type one of the
messages in this thread I think about the fact that I haven't added
any keywords, and that it would be difficult to characterize this
conversation in such a way that it could be differentiated from every
other conversation about improvements to existing systems. Our lexicon
is fairly limited, yet having such a narrow scope makes it no less
difficult to intelligently choose keywords characterizing it, such
that such an effort would be an improvement over a full text search.    (03)

[...]    (04)

>>Sophistication could be added in the read-in phase.
>>For example, polysemy might be attacked by some algorithm that
>>makes guesses about the word type based on a grammar.
>>Locations might be narrowed to paragraphs by chunking them beforehand.
>>And so on.
> You make this sound easy. After watching the list for a while it
> is clear that we don't have the collective time for this measure
> of complexity.  Are we talking about implementing something to
> use now and experiment and develop, or are we talking about an
> ideal eventual system that would work in a variety of capacities?
> We can talk the theory (I'd love to) but that stuff has been
> beaten to death here and elsewhere. How do we distinguish between
> the speculative talk and the plans for action?    (05)

And how do we come to some plan of action that doesn't either
1. put requirements on us to characterize each single email or
email thread (and in such a way that is actually valuable), or
2. provide little more than an existing full text search engine
would provide. We can always experiment with different search
engines on the archives without require authors to change their
behaviour (always a problem). Sun Labs put out a fairly
sophisticated engine recently, which I think isn't open source
but can be used for free.    (06)

Murray    (07)

Murray Altheim                  <http://kmi.open.ac.uk/people/murray/>
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK    (08)

      In the evening
      The rice leaves in the garden
      Rustle in the autumn wind
      That blows through my reed hut.  -- Minamoto no Tsunenobu    (09)