treeing

Crawl, index and search web content

Java

h1. Overview

Treeing a simple Java project that demonstrates Website crawling, indexing and searching using Apache Lucene. This project provides fundamentals for building website search engines.

h1. Usage

Treeing provides a simple set of classes, including a multi-threaded web crawler and a simple HTML parser, and is very easy to embed in Java application.

h2. Crawling and indexing websites

Below is an example of how to run the website crawler that crawls and indexes the web content.

                                                                           
                                                                         
  Thread t = new Thread( new WebCrawler( "http://www.python.org", 2, "c:/test/luc" ) );
  t.start();
  t.join();

The above code snippet will crawl and index the entire web site.

h2. Searching the index

Once the index is created it can be searched by using Lucene API. See the test case provided with the source for more details.

h1. Resources

“Apache Lucene”:http://lucene.apache.org/java/docs/index.html
“Crawl, index and search”:http://www.xeustechnologies.org/kamran/2011/03/29/crawl-index-and-search/