Re: [ba-ohs-talk] Keyword Indexing
The home page for CML is at < http://www.xml-cml.org/>, where details of the
specification, software, examples, and users can be found. (01)
Also, checkout the Virtual Hypertext Glossary (VHG): (02)
Screenshots of VHG in action
< http://www.vhg.org.uk/pub/xmldev/index.html > (03)
Chemistry in hyperGlossaries
< http://www.vhg.org.uk/pub/metadata/chemical.html > (04)
CML VirtualXML Concourse
< http://www.cmlconsulting.com/vxml/concourse/ > (05)
"N. Carroll" wrote: (06)
> ----- Original Message -----
> From: blincoln <email@example.com>
> To: <firstname.lastname@example.org>
> Sent: Tuesday, April 30, 2002 12:13 PM
> Subject: Re: [ba-ohs-talk] Keyword Indexing
> > >since that is actually creating new words and phrases. A comma is
> > >pretty simple to type and is (in English) a _word_ or _phrase_delimiter_.
> > A friend and I are working on a java-based keyword indexer at the moment
> and are
> > confronting a problem which it seems like must have been solved 10,000
> > times already.
> Chemical indexing has been going on awhile, e.g.
> For that matter, CML (Chemical Markup Language) *should*
> have addressed the issue long ago. Based on a bit of googling,
> the problem seems to be an embarrassment of riches -- too
> many methods, like
> Maybe someone at ASIS (www.asis.org) could tell you if there's
> now a standard syntax for making chem names retrievable?
> It is a necessary requirement that the indexer be capable of
> > indexing highly technical words. Our area involves a lot of chemical
> > which includes commas _not_ as delimiters but as word-chars. Like:
> > N,N-dimethyltryptamine or 3,4-methylpropylamine or others of the sort.
> > They can get more complicated. We have not yet found any adequate
> > but if anyone knows of any keyword parsing / tokenizing rule sets that
> > gone through some iterations in development, let me know. It seems like
> > the only real chance for tokenizing technical language properly must be
> > some sort of dictionary lookup? What a pain.
> > The following characters can be part of a single chemical notation: , [ ]
> ( ) + -
> > (minus and dash)
> > So far, it looks like we will just lose a large portion of the chemical
> > to the indexer. Another idea I've been toying with is the idea of
> tokenizing twice
> > and indexing both results.. So I would have a set of "always delimiters"
> > would break words for both sets (space is always a delimiter), and "normal
> > delimiters" which would be things like the comma, bracket, parentheses.
> > Create both list of tokens for a given text block, creating a 'normal
> list' and
> > a 'technical list' (which is not tokenized with the 'normal delimiters').
> > Remove from the 'technical list' any keys that do not contain any of the
> > technical delimiters, and then I have a list of 'normal keywords' and a
> list of
> > possible 'technical keywords'.
> > Sounds horrifyingly slow for the project I'm working on (a keyword indexer
> > for a spider), but its the best I've come up with so far..
> > any thoughts or ideas?
> > bcl
> Nicholas Carroll
> Travel: email@example.com
> "The hardest single part of building a software system
> is deciding precisely what to build." -- Frederick Brooks (07)