serritor

Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that require JavaScript to render data.

peterbencze

Java

Serritor

Serritor is an open source web crawler framework built upon Selenium
and written in Java. It can be used to crawl dynamic web pages that require JavaScript to render
data.

Using Serritor in your build

Maven

Add the following dependency to your pom.xml:

<dependency>
    <groupId>com.github.peterbencze</groupId>
    <artifactId>serritor</artifactId>
    <version>2.1.1</version>
</dependency>

Gradle

Add the following dependency to your build.gradle:

compile group: 'com.github.peterbencze', name: 'serritor', version: '2.1.1'

Manual dependencies

The standalone JAR files are available on the
releases page.

Documentation

The Wiki contains usage information and examples
The Javadoc is available here

Quickstart

The Crawler abstract class provides a skeletal implementation of a crawler to minimize the effort
to create your own. The extending class should implement the logic of the crawler.

Below you can find a simple example that is enough to get you started:

public class MyCrawler extends Crawler {

    private final UrlFinder urlFinder;

    public MyCrawler(final CrawlerConfiguration config) {
        super(config);

        // A helper class that is intended to make it easier to find URLs on web pages
        urlFinder = UrlFinder.createDefault();
    }

    @Override
    protected void onResponseSuccess(final ResponseSuccessEvent event) {
        // Crawl every URL found on the page
        urlFinder.findAllInResponse(event.getCompleteCrawlResponse())
                .stream()
                .map(CrawlRequest::createDefault)
                .forEach(this::crawl);

        // ...
    }
}

By default, the crawler uses the HtmlUnit headless browser:

// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFilterEnabled(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
        .build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start crawling with HtmlUnit
crawler.start();

Of course, you can also use other browsers. Currently Chrome and Firefox are supported.

// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFilterEnabled(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
        .build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start crawling with Chrome
crawler.start(Browser.CHROME);

That’s it! In just a few lines you can create a crawler that crawls every link it finds, while
filtering duplicate and offsite requests. You also get access to the WebDriver, so you can use
all the features that are provided by Selenium.

Special thanks

For providing a free JetBrains Open Source license to support the development of this project.

Support

If this framework helped you in any way, or you would like to support the development:

Any amount you choose to give will be greatly appreciated.

License

The source code of Serritor is made available under the
Apache License, Version 2.0.