Tool to extract web pages from warc.gz and write content documents. Each line of file is composed by one document.
Tool to extract web pages from warc.gz and write content documents. Each line of file is composed by one document.
#How to use
$ python warcParser.py PATH_DATASET INITIAL_FILE_COUNTER