I have a python3 crawler that connect to target sites and saves all html and resources. Although I compress with gzip before saving it consumes too much space and I usually reach my configured space limit before less than half of website pages are crawled.
The point is that all pages of the same website have a lot of common strings (there are even websites that include resources like css in all html pages instead linking then). Then my idea is saving the common strings for the same website. I thought this kind of optimization would be documented, but I didn't found anything about this.
Although I have this idea, I don't know how to implement this kind of algorithm. Any help would be appreciated.
from How to save storage in python crawler (common strings)
No comments:
Post a Comment