[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: Great? idea for improving this list (was Re: [ba-ohs-talk] Freezope learning environments)


The other way to do things parallels (I think) some of the
stuff that Chris Dent has done.
1- Parse the existing archive for terms, recording locations of terms
2- Cull out anything useless like stop-words (e.g. 'the', 'and', etc.)
3- Parse any new mails against this growing index, recording locations
of terms
4- Check the new terms list every now and again.
(Repeat 2 as necessary.)    (01)

5- Make a topic map/semantic net out of the terms if you like,
for future uses, graphical interfaces, paraphrase searches, whatever...    (02)

One advantage I see in doing it this way is that
statistical NLP can be used to make an attempt
on tackling polysemy when mails are input whilst the process itself is
largely
automatic after the first few iterations.    (03)

(I must get around to reading the GATE manual.)    (04)

--
Peter    (05)



----- Original Message -----
From: "Murray Altheim" <m.altheim@open.ac.uk>
To: <ba-ohs-talk@bootstrap.org>
Sent: Thursday, April 25, 2002 1:45 PM
Subject: Re: Great? idea for improving this list (was Re: [ba-ohs-talk]
Freezope learning environments)    (06)


> <?urn:ba-ohs:keywords: email, metadata ?>
>
> Eugene Eric Kim wrote:
>
> > On Fri, 19 Apr 2002, Alex Shapiro wrote:
> >
> >
> >>Why not make a list of metadata keywords that could optionally be
prepended
> >>to the subject line of posts to this group.
> >>
> >
> > ...
> >
> >>What do you guys think?
> >
> > I'm willing to give it a shot.  Who else is in?
> >
> > Doesn't have to be me writing the program to cull keywords, although
I'll
> > definitely do it if enough people use them.
>
>
> Honestly, I'd prefer it if we not do this. The subject lines of most
> messages are already overburdened with overloaded metadata. Since this
> could work just as well on content in body text, why not add something
> a program could grep on, either at the beginning of the text or the
> end? It's also got to be something that works in both plaintext and
> HTML mail messages, and should have some type of namespace identifier:    (07)

>
>     [beginning of email message]
>     [email header]
>     [first whitespace]
>     <?urn:ba-ohs:keywords: keyword_1, short phrase_1, ...keyword_n ?>
>
> using some sort of content delimiter such as "{...}", "{{...}}", etc.
> I chose the XML PI notation because it's something easily parsed,
> and something simple to type. Essentially, you'd grep for
>
>    <?urn:ba-ohs:keywords:
>
> and then parse up until "?>", using commas as delimiters within. The
> opening and closing delimiters could be anything so long as the URN
> was there, so it could likewise be something like:
>
>    {urn:ba-ohs:keywords: keyword_1, short phrase_1, ...keyword_n }
>
> This could eventually be embedded in the mail header when email
> software gets smart about this, if it ever does.
>
> The real question it seems to me here is to somehow regularize
> the tokens used within a community so that people begin to use
> the same ones to describe the same things. This starts to sound
> like a job for topic maps!
>
> Murray
>
> ......................................................................
> Murray Altheim                  <http://kmi.open.ac.uk/people/murray/>
> Knowledge Media Institute
> The Open University, Milton Keynes, Bucks, MK7 6AA, UK
>
>       In the evening
>       The rice leaves in the garden
>       Rustle in the autumn wind
>       That blows through my reed hut.  -- Minamoto no Tsunenobu
>
>    (08)