Lucene is an open-source tunable indexing platform often used for full-text indexing of Web sites. It implements an inverted index, creating posting lists for each term of the vocabulary. This chapter proposes some exercises to discover the Lucene platform and test its functionalities through its Java API.
PRELIMINARY: A LUCENE SANDBOX
We provide a simple graphical interface that lets you capture a collection of Web documents (from a given Web site), index it, and search for documents matching a keyword query. The tool is implemented with Lucene (surprise!) and helps to assess the impact of the search parameters, including ranking factors.
You can download the program from our Web site. It consists of a Java archive that can be executed right away (provided you have a decent Java installation on your computer). Figure 17.1 shows a screenshot of the main page. It allows you to
Download a set of documents collected from a given URL (including local addresses),
Index and query those documents,
Consult the information used by Lucene to present ranked results.
Use this tool as a preliminary contact with full text search and information retrieval. The projects proposed at the end of the chapter give some suggestions to realize a similar application.
INDEXING PLAIN TEXT WITH LUCENE – A FULL EXAMPLE
We embark now in a practical experimentation with Lucene. First, download the Java packages from the Web site http://lucene.apache.org/java/docs/.