I have two MongoDBs, one database urls is used by a spider that gathers URLs. This database is quite large, and mostly just contains urls. A second db, posts, is used by a program that scans the urls and generates a report based on the url.
The code I'm currently using for the second script checks if a url in the urls database is currently in the posts database. If the posts db does not contain the url it means the program still needs to generate a report for the url. If it exists, we skip it.
Here is the database loop:
for document in urls.find():
url = document['url'].split('.')[1]
if posts.find({'url': url}).count() == 0:
print(url, " url not found in posts, generating a new report")
try:
get_report(url, posts)
...
At first this seemed like a simple solution. However, after the posts db was populated with over 50,000 reports, this loop takes hour(s) to begin.
Is there a faster / more efficient way to perform this loop? I'm using python3 with pymongo.
Additionally, the script is now crashing with a pymongo.errors.CursorNotFound: cursor id '…' error. I believe this means I need to set the batch size to a lower value. However, this is only re-enforcing my belief that something about this loop is extremely inefficient.
from Optimize speed when comparing fields from one MongoDB to another
No comments:
Post a Comment