Hemant Vishwakarma: Code efficiency/performance improvement in Pushshift Reddit web scraping loop

I am extracting Reddit data via the Pushshift API. More precisely, I am interested in comments and posts (submissions) in subreddit X with search word Y, made from now until datetime Z (e.g. all comments mentioning "GME" in subreddit /rwallstreetbets). All these parameters can be specified. So far, I got it working with the following code:

import pandas as pd
import requests
from datetime import datetime
import traceback
import time
import json
import sys
import numpy as np

username = ""  # put the username you want to download in the quotes
subreddit = "gme"  # put the subreddit you want to download in the quotes
search_query = "gamestop" # put the word you want to search for (present in comment or post) in the quotes
# leave either one blank to download an entire user's, subreddit's, or search word's history
# or fill in all to download a specific users history from a specific subreddit mentioning a specific word

filter_string = None
if username == "" and subreddit == "" and search_query == "":
    print("Fill in either username or subreddit")
    sys.exit(0)
elif username == "" and subreddit != "" and search_query == "":
    filter_string = f"subreddit={subreddit}"
elif username != "" and subreddit == "" and search_query == "":
    filter_string = f"author={username}"
elif username == "" and subreddit != "" and search_query != "":
    filter_string = f"subreddit={subreddit}&q={search_query}"
elif username == "" and subreddit == "" and search_query != "":
    filter_string = f"q={search_query}"    
else:
    filter_string = f"author={username}&subreddit={subreddit}&q={search_query}"

url = "https://api.pushshift.io/reddit/search/{}/?size=500&sort=desc&{}&before="

start_time = datetime.utcnow()

def redditAPI(object_type):
    global df_comments    
    df_comments = pd.DataFrame(columns=["date", "comment", "score", "id"])
    global df_posts    
    df_posts = pd.DataFrame(columns=["date", "post", "score", "id"])      

    print(f"\nLooping through {object_type}s and append to dataframe...")

    count = 0
    previous_epoch = int(start_time.timestamp())
    while True:
        # Ensures that loop breaks at March 16 2021 for testing purposes
        if previous_epoch <= 1615849200:
            break
            
        new_url = url.format(object_type, filter_string)+str(previous_epoch)
        json_text = requests.get(new_url)
        time.sleep(1)  # pushshift has a rate limit, if we send requests too fast it will start returning error messages
        try:
            json_data = json.loads(json_text.text)
        except json.decoder.JSONDecodeError:
            time.sleep(1)
            continue

        if 'data' not in json_data:
            break
        objects = json_data['data']
        if len(objects) == 0:
            break
            
        df2 = pd.DataFrame.from_dict(objects)

        for object in objects:
            previous_epoch = object['created_utc'] - 1
            count += 1
            if object_type == "comment":
                    df2.rename(columns={'created_utc': 'date', 'body': 'comment'}, inplace=True)
                    df_comments = df_comments.append(df2[['date', 'comment', 'score']])               
            elif object_type == "submission":
                    df2.rename(columns={'created_utc': 'date', 'selftext': 'post'}, inplace=True)
                    df_posts = df_posts.append(df2[['date', 'post', 'score']])
                    
    # Convert UNIX to datetime
    df_comments["date"] = pd.to_datetime(df_comments["date"],unit='s')
    df_posts["date"] = pd.to_datetime(df_posts["date"],unit='s')
    
    # Drop blank rows (the case when posts only consists of an image)
    df_posts['post'].replace('', np.nan, inplace=True)
    df_posts.dropna(subset=['post'], inplace=True)
    
    # Drop duplicates (see last comment on https://www.reddit.com/r/pushshift/comments/b7onr6/max_number_of_results_returned_per_query/)
    df_comments = df_comments.drop_duplicates()
    df_posts = df_posts.drop_duplicates()
    print("\nDone. Saved to dataframe.")

Unfortunately, I do have some performance issues. Due to the fact that I paginate based on created_utc - 1 (and since I do not want to miss any comments/posts), the initial dataframe will contain duplicates (since there won't be 100 (=API limit) new comments/posts every new second). If I run the code for a long time frame (e.g. current time - 1 March 2021), this will result in a huge dataframe which takes considerably long to process.

As the code is right now, the duplicates are added to the dataframe and only after the loop, they are removed. Is there a way to make this more efficient? E.g. to check within the for loop whether the object already exists in the dataframe? Would this make a difference, performance wise? Any input would be very much appreciated.

from Code efficiency/performance improvement in Pushshift Reddit web scraping loop

Hemant Vishwakarma

Tuesday, 30 March 2021

Code efficiency/performance improvement in Pushshift Reddit web scraping loop

No comments:

Post a Comment