"Currently, building topic-specific search engines for text or hypertext
documents involves programmers using toolkits consisting of various
machine-learning-based techniques to classify and extract information
from text. However, we believe that in the near future, WhizBang! Labs
and others will make advances enabling users to create topic-specific
search engines that support their particular interests "by example."
This advance will allow nonprogrammers to construct systems to
manipulate and analyze text, much as spreadsheets allow nonprogrammers
to manipulate numeric data today."
http://www.futureofsoftware.net/wc0010/wc0010.asp
Interesting examples, but I can't find a link for more info about the
Cora System. Does anyone have any experience using these applications?
* Currently, no tool allows nonprogrammers to create topic-specific
search engines on any topic of their choice. However, a number of
precursor tools do exist. One is the Cora system. Cora
automatically spiders, classifies, and extracts computer-science
research papers from the Web. It automatically organizes papers
into a taxonomy with 75 leaves, and various fields, such as author
and title, are extracted from each paper. Additionally, the program
extracts bibliographic information from each paper, allowing
bibliometric analysis to be performed. Cora relies heavily on
artificial intelligence and machine-learning techniques. It
efficiently spiders for research papers using reinforcement
learning, it automatically categorizes papers into the topic
hierarchy by probabilistic techniques, and it automatically
extracts papers' titles, authors, and references using hidden
Markov models.
* Another set of precursor systems are Web-based information
integration systems, which collect the information from several
heterogeneous Web sites into a single database. Most of these
systems require "wrappers" you must custom code for each site they
access, which limits the number of sites you can use; however, such
a system can cover a lot of ground. For example, the Whirl system
provided interfaces to 50 different sites with approximately four
man-months of development time, and additional experiments suggest
the cost of "wrapping" data for a Whirl-like system can be reduced
even further. <http://www.research.att.com/projects/whirl/doc/> <
http://whirl.research.att.com/>
* Our organization, WhizBang! Labs, has developed a set of tools that
facilitate the creation of topic-specific search engines. We've
recently fielded a commercial system using these tools:
FlipDog.com, an online job board based on a database constructed by
automatically extracting job postings directly from corporate Web
pages. We created the FlipDog.com database primarily by applying
general-purpose machine-learning techniques—techniques that could
well be used to extract other sorts of databases from the Web.
* FlipDog.com contains more than 550,000 jobs gathered from nearly
50,000 different corporate Web sites. Operationally, constructing
the FlipDog.com database is much like the process used in Cora.
Sites are automatically spidered to find pages that contain job
postings. Individual job postings are then extracted from these
pages, and augmented with automatically extracted fields such as
employer, job title, job description, and location. Job postings
are also organized into a taxonomy to facilitate browsing.
This archive was generated by hypermail 2b29 : Sun May 06 2001 - 00:19:39 PDT