[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-ohs-talk] Keyword Indexing

I'm quite familiar with this problem. For commas, it is sufficient to simply
parse into words based on whitespace and em-dashes first, and then check for
punctuation only at the ends of words. The much trickier thing is periods...
Kevin Keck
keck@kecklabs.com    (01)

on 2002/04/30 12:13 PM, blincoln at blincoln@ssesco.com wrote:    (02)

>> since that is actually creating new words and phrases. A comma is
>> pretty simple to type and is (in English) a _word_ or _phrase_delimiter_.
> A friend and I are working on a java-based keyword indexer at the moment and
> are
> confronting a problem which it seems like must have been solved 10,000
> times already.  It is a necessary requirement that the indexer be capable of
> indexing highly technical words.  Our area involves a lot of chemical notation
> which includes commas _not_ as delimiters but as word-chars.  Like:
> N,N-dimethyltryptamine or 3,4-methylpropylamine or others of the sort.
> They can get more complicated.  We have not yet found any adequate solution
> but if anyone knows of any keyword parsing / tokenizing rule sets that have
> gone through some iterations in development, let me know.  It seems like
> the only real chance for tokenizing technical language properly must be
> some sort of dictionary lookup?  What a pain.
> The following characters can be part of a single chemical notation: , [ ] ( )
> + - 
> (minus and dash)
> So far, it looks like we will just lose a large portion of the chemical
> notations
> to the indexer.  Another idea I've been toying with is the idea of tokenizing
> twice
> and indexing both results..  So I would have a set of "always delimiters"
> which 
> would break words for both sets (space is always a delimiter), and "normal
> delimiters" which would be things like the comma, bracket, parentheses.
> Create both list of tokens for a given text block, creating a 'normal list'
> and
> a 'technical list' (which is not tokenized with the 'normal delimiters').
> Remove from the 'technical list' any keys that do not contain any of the
> technical delimiters, and then I have a list of 'normal keywords' and a list
> of
> possible 'technical keywords'.
> Sounds horrifyingly slow for the project I'm working on (a keyword indexer
> for a spider), but its the best I've come up with so far..
> any thoughts or ideas?
> bcl
>     (03)