Crawl, index and search web content
h1. Overview
Treeing a simple Java project that demonstrates Website crawling, indexing and searching using Apache Lucene. This project provides fundamentals for building website search engines.
h1. Usage
Treeing provides a simple set of classes, including a multi-threaded web crawler and a simple HTML parser, and is very easy to embed in Java application.
h2. Crawling and indexing websites
Below is an example of how to run the website crawler that crawls and indexes the web content.
Thread t = new Thread( new WebCrawler( "http://www.python.org", 2, "c:/test/luc" ) );
t.start();
t.join();
The above code snippet will crawl and index the entire web site.
h2. Searching the index
Once the index is created it can be searched by using Lucene API. See the test case provided with the source for more details.
h1. Resources
“Apache Lucene”:http://lucene.apache.org/java/docs/index.html
“Crawl, index and search”:http://www.xeustechnologies.org/kamran/2011/03/29/crawl-index-and-search/