:sunrise: next generation web crawling using machine intelligence
sky is a web scraping framework, implemented with the latest python versions in mind (3.5+). It uses the asynchronous asyncio
framework, as well as many popular modules and extensions.
Most importantly, it aims for next generation web crawling where machine intelligence is used to speed up the development/maintainance/reliability of crawling.
It mainly does this by considering the user to be interested in content from domains, not just a collection of single pages (templating approach).
See it live in action with a news website YOU propose:
Note that the following is only meant as a demo of some kind of app that could be built upon the scraping framework.
Make no mistake: the goal is to provide a smart-scraper, not some ugly UI.
Run:
pip3 install -U sky
sky view
at the command line (use -port PORT
to change port)[>>>]
.The demo uses a standard configuration that can easily be improved on when setting up a project.
Similar data (title, body, publish_date, images etc) will be very easy to use in your own applications.
These are the features/goals of sky
. Checkmarks have been accomplished:
Use pip to install sky:
pip3 install -U sky
This will install only the required packages. Storing data on the local system does not require any other packages.
To store data, the following optional backends are currently available: elasticsearch, cloudant and ZODB.
To setup a project/crawling service, visit this readme for a “Getting started”.
It is very much appreciated if you’d like to contribute in one or more of the following areas:
By considering crawl content to originate from a domain, rather than individual pages, the following willl be possible: