Top Java Frameworks & Libraries for web crawling

WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawl...

Apache Nutch is an extensible and scalable web crawler

Continuous scalable web crawler built on top of Flink and crawler-commons

:octocat:A Fast and Powerful Scraping and Web Crawling Framework.

The information system chosen for the project was a stock investment management website providing live prices, historical data, news articles, etc and also basic a...

This is a Java library which can be used to crawl the content of some of web properties (www.salesforce.com, blogs.salesforce.com for example). It supports dynamic...

WebHunger is an extensible, full-scale crawler framework that supports distributed crawling, aiming at getting users focused on web page parsing without concerning...

Spider4j is an open source web crawler expand from webmagic for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thread...

Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that require JavaScript to rend...

We will process unstructured data from web (obtained by crawling some sample websites) by maybe: having a Apache SolR installation locally and manually feeding it...

A tweet analyzer capable of performing a wide range of tasks such as identification, crawling, sentiment analysis, co-occurrence analisys, web scaping, predictions...

Hubs is a content crawler application on Android. It provides apis to crawl web content and display data....

Sample MVP project uses jsoup-web-crawl like API

An android application which does web crawl

Crawl, index and search web content

Spider4j is an open source web crawler expand from webmagic for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thread...

web crawler allowing full page render crawl using HtmlUnit

A version of the Bixo web mining project that uses Storm to do continuous crawling.

Visualizing the behaviour of semantic web crawl algorithms.

This is a web application project for my assignment from FPT University. It's about crawling pieces of data about sim card from the certain websites, transforming...

Google Search Bot is a Telegram bot project that searches from the Web. It crawls search results from Google and passes the first 50 results to the Telegram bot as...

This spider can crawl a website, and return a clean form of the content on the website/page in a nice web of http response, it also include PDF Text extraction an...

Apache Fluo application that creates a web index using Common Crawl data

Distributed web crawling and indexing using hadoop

A utility to crawl websites and import their pages and links as nodes and relationships into a Neo4j graph database...

A Library for web crawling websites harvesting URLs of embedded links and images

It crawls. It does metadata. It stores most of it.

QuickLB - Easy to use TCP load balancer. QuickLB is a free,fast and reliable solution offering high availability, load balancing, and proxying for TCP based applic...

This repository contains the source code of VSearch, a vertical search tool to crawl the Deep Web by using a unified Web Query Interface (WQI)...

A Java application that crawls a specified seed web page, builds an inverted index, and starts a search engine for a website that provides services including: weig...