[unrev-II] OHS Precursors for Unlocking the Information in Text?

From: John J. Deneen (jjdeneen@ricochet.net)
Date: Sun May 06 2001 - 00:07:35 PDT

  • Next message: Garold L. Johnson: "[unrev-II] The Long Now Foundation"

    "Currently, building topic-specific search engines for text or hypertext
    documents involves programmers using toolkits consisting of various
    machine-learning-based techniques to classify and extract information
    from text. However, we believe that in the near future, WhizBang! Labs
    and others will make advances enabling users to create topic-specific
    search engines that support their particular interests "by example."
    This advance will allow nonprogrammers to construct systems to
    manipulate and analyze text, much as spreadsheets allow nonprogrammers
    to manipulate numeric data today."
    http://www.futureofsoftware.net/wc0010/wc0010.asp

    Interesting examples, but I can't find a link for more info about the
    Cora System. Does anyone have any experience using these applications?

       * Currently, no tool allows nonprogrammers to create topic-specific
         search engines on any topic of their choice. However, a number of
         precursor tools do exist. One is the Cora system. Cora
         automatically spiders, classifies, and extracts computer-science
         research papers from the Web. It automatically organizes papers
         into a taxonomy with 75 leaves, and various fields, such as author
         and title, are extracted from each paper. Additionally, the program
         extracts bibliographic information from each paper, allowing
         bibliometric analysis to be performed. Cora relies heavily on
         artificial intelligence and machine-learning techniques. It
         efficiently spiders for research papers using reinforcement
         learning, it automatically categorizes papers into the topic
         hierarchy by probabilistic techniques, and it automatically
         extracts papers' titles, authors, and references using hidden
         Markov models.

       * Another set of precursor systems are Web-based information
         integration systems, which collect the information from several
         heterogeneous Web sites into a single database. Most of these
         systems require "wrappers" you must custom code for each site they
         access, which limits the number of sites you can use; however, such
         a system can cover a lot of ground. For example, the Whirl system
         provided interfaces to 50 different sites with approximately four
         man-months of development time, and additional experiments suggest
         the cost of "wrapping" data for a Whirl-like system can be reduced
         even further. <http://www.research.att.com/projects/whirl/doc/> <
         http://whirl.research.att.com/>

       * Our organization, WhizBang! Labs, has developed a set of tools that
         facilitate the creation of topic-specific search engines. We've
         recently fielded a commercial system using these tools:
         FlipDog.com, an online job board based on a database constructed by
         automatically extracting job postings directly from corporate Web
         pages. We created the FlipDog.com database primarily by applying
         general-purpose machine-learning techniques—techniques that could
         well be used to extract other sorts of databases from the Web.

       * FlipDog.com contains more than 550,000 jobs gathered from nearly
         50,000 different corporate Web sites. Operationally, constructing
         the FlipDog.com database is much like the process used in Cora.
         Sites are automatically spidered to find pages that contain job
         postings. Individual job postings are then extracted from these
         pages, and augmented with automatically extracted fields such as
         employer, job title, job description, and location. Job postings
         are also organized into a taxonomy to facilitate browsing.



    This archive was generated by hypermail 2b29 : Sun May 06 2001 - 00:19:39 PDT