[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-ohs-talk] Keyword Indexing


Alex Shapiro wrote:    (01)

> Regarding:
> http://www.bootstrap.org/lists/ba-ohs-talk/0204/msg00126.html
> http://www.bootstrap.org/lists/ba-ohs-talk/0204/msg00132.html
> http://www.bootstrap.org/lists/ba-ohs-talk/0204/msg00176.html
> http://www.bootstrap.org/lists/ba-ohs-talk/0204/msg00194.html
> 
> I strongly agree with the above messages, especially Chris Dent's last 
> message which I was going to quote.  However, I  found myself saying 
> Exactly, Exactly, Exactly to all the lines :) so I guess that I'll just 
> spare the clutter.
> 
> *1* KWD LOCATION: I think that we are in agreement that the keywords 
> should be the first line of the email.    (02)


yes, this makes the most sense. If this got reallly going, it could
be put into the interface, but we can't expect that.    (03)


> *2* KWD ENVELOPE: I was tempted to agree with Murray's suggestion that 
> the envelope for the keywords should be more complex then [], because of 
> the argument that some emails might be html based.  However, I think 
> that everyone on this list mostly uses text, and I see that one of my 
> messages which had some bold text (which I assume needs html), came out 
> looking like plain text in the ba-ohs archives.  So, as long as parsing 
> is not a problem, which it looks like it won't be, I say that using 
> square brackets around the keywords is fine.    (04)


We should strongly avoid "]]" because that followed by ">" is a CDATA
section end, which *could* cause problems in processing in XML. But
the thing for me was having a namespace identifier, even a short one.
Something beyond a single character like "[]" to search on, since any
single character shows up in plain text. If the URN is too long or
too hard to remember or type, then I suggest this:    (05)

    [KEYS: word1, word2, word3 ]    (06)

so we'd write parsers that looked for "[KEYS:" instead of "[". We could
make it case insensitive, but I'd say let's not. My preference would
still be [urn:ba-ohs:keys: word1, word2, word3 ] but I can understand
people's reservations against having to type that.    (07)


> *3* KWD FORMAT: We need to agree on some sort of word separation 
> standard for keywords.  The above thread has contained the following 
> formats: FooBar, Foo_Bar, Foo-Bar.  I don't have much of a preference, 
> but I think that either the first or the second is better.  The 
> undescored version Foo_Bar seemes to be the most readable.    (08)


I don't understand what's wrong with a comma. These keywords will need
to intermix with other keyword schemes, and there may be keywords that
already contain underscores, dashes, or camel-cased style.    (09)


> *4* KWD SELECTION: In message #126 Eric suggests that we take the time 
> to come up with a list of keywords.  We could do this, but first I think 
> that we might experiment with comming up with a minimal set of basic 
> keywords, and then having every new keyword automatically added to the DB.    (010)


We could do what Netscape mail does with its address book, and have a
scraper pull all the keywords off our emails, check redundancies, and
write them to a web page we have access to as a suggested index list.
Then it's up to each one of us to scan that web page for suggested keys,
or add new ones as they arise as subjects. The problem with trying to
develop an a priori ontology is that our minds keep expanding......    (011)


> *4.1* BASIC KEYWORDS: I think that as a group we should come up with a 
> minimal set of three or four keywords that would give a general type to 
> the message.  For instance, I am thinking that messages which announce a 
> new type of software should be given the Software_Announce, or SA, (or 
> some variation) keyword.  The thing is that about 1/3 of the root 
> messages (not the followups) posted to this group announce software, and 
> it would be nice to be able to filter those out from the general 
> discussion.  Other basic keywords might include Document_Announce, 
> Conference_Announce, Fun_Announce, and Seeking_Software.    (012)


This probably is a good idea, but we could adopt an existing set. There's
many available, such as Google's, Yahoo's, OCML has one, Dublin Core too.    (013)


> In theory there should be one basic keyword per message.  The purpose of 
> these keywords is to provide an intermediate level of specification 
> between the current thread structure, and the fine grain keyboarding to 
> be discussed next.  I envison using archives of messages aggregated by 
> these keywords to queries of "I just saw some cool software mentioned 
> recently, but I don't remember what it was".    (014)


Generally, the first keyword in a list would be considered the primary one.    (015)


> *4.2* FINE GRAINED KEYWORDS:  Besides basic keywords there can also be 
> fine grained keywords, such as IBIS, Google, Graphs, etc.  My suggestion 
> is that instead of wasting time arguing about these, we allow any user 
> to use any keyword.  New keywords will automatically be added to a database.    (016)


Yes.    (017)


> The idea is that we should eventually settle on some common keywords by 
> convention.  I am sure that there will be some tension here, but I think 
> that spreading the tension out over the first few weeks of use is better 
> then arguing about this stuff before we are even completely sure what we 
> are working with.  It is simpler to see how to categorize new messages, 
> then to have long arguments about how we should have categorized older 
> messages.    (018)


Look at the SUO list to see that this is a pointless battle. You shouldn't
try to legislate this at all. I think we should have a topic map engine
and some manual labour to provide synonym matching periodically, but I
would be against trying to force people to use an specific set. Nobody
would do it. Too much trouble, and unlikely to succeed, witness thousands
of years of history in this regard.    (019)


> An idea that I had for keywords is that they could be placed in 
> hierarchies, for instance Google.API.  A message tagged with the 
> Google.API could be seen by viewing both "Google" and "Google.API", but 
> not the other way around.  This way, if you chose a subcategory which 
> other users did not agree on, your message would still be captured by 
> the parent category.    (020)


I guess I should have read this message all the way through before
responding....    (021)

[...]    (022)

> ==========
> 
> Ok?  So the actionable items are to come up with a list of basic 
> keywords, decide on the multi-word format, and figure out how to 
> implement this type of system.  I think that from a technical standpoint 
> this problem is not too bad at all.  And it would be a big help managing 
> the 200 or so messages that we receive each month, who besides Rod could 
> remember all that :)    (023)


My list would be to avoid the basic keywords unless we all think say,
Google is a good start (it's really not worth the fight), multi-word
format is whitespace-delimited normal text, with commas between (just
like every other keyword system I've ever seen), so *really* it's
a matter of implementation.    (024)

But a more meta question is: what exactly are the requirements, and
what are the benefits? Couldn't all this be done server-side on the
mail list archives, such that if one wanted to browse the archives
an intelligent search could remove the need for most of the effort?
I'm still not convinced that people would use it, insofar as it's
probably almost as much work to figure out an *appropriate* set of
keywords as it is to type an entire email message. Librarians are
*experts* at this. I'm not. There's a CMU system I used at NTTC that
could analyze a text and come up with a set of keywords for it. I'd
prefer we leave this kind of thing to computers (which are in general
pretty good at it, especially on longer texts).    (025)

Not to be *too* much of a spoil-sport.    (026)

Murray    (027)

......................................................................
Murray Altheim                  <http://kmi.open.ac.uk/people/murray/>
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK    (028)

      In the evening
      The rice leaves in the garden
      Rustle in the autumn wind
      That blows through my reed hut.  -- Minamoto no Tsunenobu    (029)