Hemant Vishwakarma: Optimize speed when comparing fields from one MongoDB to another

Thursday, 29 August 2019

Optimize speed when comparing fields from one MongoDB to another

I have two MongoDBs, one database urls is used by a spider that gathers URLs. This database is quite large, and mostly just contains urls. A second db, posts, is used by a program that scans the urls and generates a report based on the url.

The code I'm currently using for the second script checks if a url in the urls database is currently in the posts database. If the posts db does not contain the url it means the program still needs to generate a report for the url. If it exists, we skip it.

Here is the database loop:

for document in urls.find():
    url = document['url'].split('.')[1]

    if posts.find({'url': url}).count() == 0:
        print(url, " url not found in posts, generating a new report")

        try:
            get_report(url, posts)
        ...

At first this seemed like a simple solution. However, after the posts db was populated with over 50,000 reports, this loop takes hour(s) to begin.

Is there a faster / more efficient way to perform this loop? I'm using python3 with pymongo.

Additionally, the script is now crashing with a pymongo.errors.CursorNotFound: cursor id '…' error. I believe this means I need to set the batch size to a lower value. However, this is only re-enforcing my belief that something about this loop is extremely inefficient.

from Optimize speed when comparing fields from one MongoDB to another

Hemant Vishwakarma

Thursday, 29 August 2019

Optimize speed when comparing fields from one MongoDB to another

No comments:

Post a Comment