[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-ohs-talk] Keyword Indexing

>since that is actually creating new words and phrases. A comma is
>pretty simple to type and is (in English) a _word_ or _phrase_delimiter_.    (01)

A friend and I are working on a java-based keyword indexer at the moment and are
confronting a problem which it seems like must have been solved 10,000
times already.  It is a necessary requirement that the indexer be capable of
indexing highly technical words.  Our area involves a lot of chemical notation
which includes commas _not_ as delimiters but as word-chars.  Like:    (02)

N,N-dimethyltryptamine or 3,4-methylpropylamine or others of the sort.      (03)

They can get more complicated.  We have not yet found any adequate solution
but if anyone knows of any keyword parsing / tokenizing rule sets that have
gone through some iterations in development, let me know.  It seems like 
the only real chance for tokenizing technical language properly must be
some sort of dictionary lookup?  What a pain.    (04)

The following characters can be part of a single chemical notation: , [ ] ( ) + - 
(minus and dash)    (05)

So far, it looks like we will just lose a large portion of the chemical notations
to the indexer.  Another idea I've been toying with is the idea of tokenizing twice
and indexing both results..  So I would have a set of "always delimiters" which 
would break words for both sets (space is always a delimiter), and "normal
delimiters" which would be things like the comma, bracket, parentheses.      (06)

Create both list of tokens for a given text block, creating a 'normal list' and
a 'technical list' (which is not tokenized with the 'normal delimiters').  
Remove from the 'technical list' any keys that do not contain any of the 
technical delimiters, and then I have a list of 'normal keywords' and a list of
possible 'technical keywords'.      (07)

Sounds horrifyingly slow for the project I'm working on (a keyword indexer
for a spider), but its the best I've come up with so far..     (08)

any thoughts or ideas?    (09)

bcl    (010)