[Date Prev] [Date Next] [Thread Prev] [Thread Next] Indexes: Main | Date | Thread | Author

Re: [ba-ohs-talk] Keyword Indexing


The home page for CML is at < http://www.xml-cml.org/>, where details of the
specification, software, examples, and users can be found.    (01)

Also, checkout the Virtual Hypertext Glossary (VHG):    (02)

Screenshots of VHG in action
<  http://www.vhg.org.uk/pub/xmldev/index.html >    (03)

Chemistry in hyperGlossaries
<  http://www.vhg.org.uk/pub/metadata/chemical.html >    (04)

CML VirtualXML Concourse
< http://www.cmlconsulting.com/vxml/concourse/ >    (05)


"N. Carroll" wrote:    (06)

> ----- Original Message -----
> From: blincoln <blincoln@ssesco.com>
> To: <ba-ohs-talk@bootstrap.org>
> Sent: Tuesday, April 30, 2002 12:13 PM
> Subject: Re: [ba-ohs-talk] Keyword Indexing
>
> > >since that is actually creating new words and phrases. A comma is
> > >pretty simple to type and is (in English) a _word_ or _phrase_delimiter_.
> >
> > A friend and I are working on a java-based keyword indexer at the moment
> and are
> > confronting a problem which it seems like must have been solved 10,000
> > times already.
>
> Chemical indexing has been going on awhile, e.g.
> www.garfield.library.upenn.edu/essays/V1p111y1962-73.pdf
> For that matter, CML (Chemical Markup Language) *should*
> have addressed the issue long ago. Based on a bit of googling,
> the problem seems to be an embarrassment of riches -- too
> many methods, like
> www.iee.org.uk/publish/support/inspec/document/ChemNum/stncni.pdf
> Maybe someone at ASIS (www.asis.org) could tell you if there's
> now a standard syntax for making chem names retrievable?
>
>   It is a necessary requirement that the indexer be capable of
> > indexing highly technical words.  Our area involves a lot of chemical
> notation
> > which includes commas _not_ as delimiters but as word-chars.  Like:
> >
> > N,N-dimethyltryptamine or 3,4-methylpropylamine or others of the sort.
> >
> > They can get more complicated.  We have not yet found any adequate
> solution
> > but if anyone knows of any keyword parsing / tokenizing rule sets that
> have
> > gone through some iterations in development, let me know.  It seems like
> > the only real chance for tokenizing technical language properly must be
> > some sort of dictionary lookup?  What a pain.
> >
> > The following characters can be part of a single chemical notation: , [ ]
> ( ) + -
> > (minus and dash)
> >
> > So far, it looks like we will just lose a large portion of the chemical
> notations
> > to the indexer.  Another idea I've been toying with is the idea of
> tokenizing twice
> > and indexing both results..  So I would have a set of "always delimiters"
> which
> > would break words for both sets (space is always a delimiter), and "normal
> > delimiters" which would be things like the comma, bracket, parentheses.
> >
> > Create both list of tokens for a given text block, creating a 'normal
> list' and
> > a 'technical list' (which is not tokenized with the 'normal delimiters').
> > Remove from the 'technical list' any keys that do not contain any of the
> > technical delimiters, and then I have a list of 'normal keywords' and a
> list of
> > possible 'technical keywords'.
> >
> > Sounds horrifyingly slow for the project I'm working on (a keyword indexer
> > for a spider), but its the best I've come up with so far..
> >
> > any thoughts or ideas?
> >
> > bcl
> >
> --
> ________________________________
> Nicholas Carroll
> ncarroll@hastingsresearch.com
> Travel: ncarroll1000@yahoo.com
> http://www.hastingsresearch.com
> ________________________________
> "The hardest single part of building a software system
> is deciding precisely what to build." -- Frederick Brooks    (07)