Java web crawling library
Smart web crawler.
A smart web crawler that fetches data from a website and stores it in some way (writes it in files on the disk or POSTs it to an http endpoint etc) .
More options for crawling:
crawl the links from a sitemap.xml
crawl the website as a graph starting from a given url (the index)
crawl with retrial if any RuntimeException
happens etc
More details in this post.
Get it using Maven:
<dependency>
<groupId>com.amihaiemil.web</groupId>
<artifactId>charles</artifactId>
<version>1.1.1</version>
</dependency>
or take the fat jar.
Charles is powered by Selenium WebDriver.
Any WebDriver implementation can be used to build a WebCrawl
Examples:
Since it uses a web driver to render the pages, also any dynamic content will be crawled (e.g. content generated by javascript)
Read this post.
Make sure the maven build
$ mvn clean install -Dgoogle.chrome={path/to/chrome} -Pitcases
passes before making a PR.
Google Chrome has to have a version >=59, in order to support headless mode.
Integration tests are performed with Google Chrome run in headless mode.
You also need to install chromedriver in order for everything to work.
You can skip the integration tests by omitting -Pitcases
from the build command.