misc warc_extractor

Tool to extract web pages from warc.gz and write content documents. Each line of file is composed by one document.

2
0
Python

warc-extractor

Tool to extract web pages from warc.gz and write content documents. Each line of file is composed by one document.

#How to use

$ python warcParser.py PATH_DATASET INITIAL_FILE_COUNTER