Thursday, 29 August 2019

Optimize speed when comparing fields from one MongoDB to another

I have two MongoDBs, one database urls is used by a spider that gathers URLs. This database is quite large, and mostly just contains urls. A second db, posts, is used by a program that scans the urls and generates a report based on the url.

The code I'm currently using for the second script checks if a url in the urls database is currently in the posts database. If the posts db does not contain the url it means the program still needs to generate a report for the url. If it exists, we skip it.

Here is the database loop:

for document in urls.find():
    url = document['url'].split('.')[1]

    if posts.find({'url': url}).count() == 0:
        print(url, " url not found in posts, generating a new report")

        try:
            get_report(url, posts)
        ...

At first this seemed like a simple solution. However, after the posts db was populated with over 50,000 reports, this loop takes hour(s) to begin.

Is there a faster / more efficient way to perform this loop? I'm using python3 with pymongo.

Additionally, the script is now crashing with a pymongo.errors.CursorNotFound: cursor id '…' error. I believe this means I need to set the batch size to a lower value. However, this is only re-enforcing my belief that something about this loop is extremely inefficient.



from Optimize speed when comparing fields from one MongoDB to another

No comments:

Post a Comment