Saturday, 17 April 2021

Length of values does not match length of index during a for loop

I have a dataset (df) like this

Name1 Name2 Score
John    NaN  NaN
Patty    NaN  NaN

where Name2 and Score are initialized to NaN. Some data, like the following

name2_list=[[Chris, Luke, Martin], [Martin]]
score_list=[[1,2,4],[3],[]]

is generated at each loop from a function. These two lists need to be added to columns Name2 and Score in my df, in order to have:

Name1 Name2         Score
John    [Chris, Luke, Martin]  [1,2,4]
Patty    [Martin]  [3]

Then, since I want to have values and not lists in Name2 and Score, I expand the dataset:

Name1 Name2  Name3
John    Chris    1
John    Luke     2
John    Martin   4
Patty   Martin   3

My goal is to have all values in Name2 in Name1. However, as I mentioned, I have a function that works as follows: for each element in Name2, not in Name1, it checks if there are further values. These values generated are similar to those ones seen for name2_list and score_list. For example, let's say that, at the second iteration, Chris has values generated from the function equal to [Patty] and 9; Luke has values [Martin] and 1; Martin has values [Laura] and 3. I need then to add these values again to my original df in order to have (before exploding)

Name1 Name2  Score
John    Chris    1
John    Luke     2
John    Martin   4
Patty   Martin   3
Chris   Patty    9
Luke    Martin   1
Martin  Laura    3

Only one value, Laura, is not in Name1 yet, so I will need to run again the function: if the output is already included in Name1, then my loop stops and I get the final dataset; otherwise, I will need to rerun the function and see if more loops are required. To make it shorter in this example, let's suppose that the value of Laura after running the function is John, 3. John is already in Name1 so I do not need to rerun the function.

To recap:

  1. my initial dataset had only John and Patty in Name1, and Name2 and Score were initialized to NaN;
  2. since it is the first iteration, Name2 is empty, so only in this case we calculate the difference between Name2 and Name1, excluding NaN values: we run the function once for John and once for Patty;
  3. the output for John is [Chris, Luke, Martin] and [1,2,4], while that one for Laura is [Martin], [3]. The function returns an output as follows: [[Chris, Luke, Martin],[Martin]] and [[1,2,4],[3]];
  4. I add these results in column Name2 and Score, then I expand the df in order to get values and not lists.
  5. I calculate the difference between Name2 and Name1: I run the function only for those values in Name2 that are NOT in Name1. 6)repeat bullet 3), 4) and 5). The loop stops when all the values in Name2 are also in Name1.

Expected output:

Name1 Name2  Score
John    Chris    1
John    Luke     2
John    Martin   4
Patty   Martin   3
Chris   Patty    9
Luke    Martin   1
Martin  Laura    3
Laura   John     3

Summarizing the problem:

  • the goal is to have all the (distinct) values in Name2 in Name1
  • the function is applied only to those values in Name2 that are not in Name1 yet
  • every time (i.e., at each iteration) I need to check the difference between values in Name2 and values in Name1, as the loop ends when the length of this difference is 0, i.e., when all the values in Name2 are in Name1.
  • since the output of the function is a list of lists (for both Name2 and Score), I need to explode my columns/dataframe, in order to work with values and not with list.

If something is still not clear in the algorithm, please let me know. What I have done is the following:

name2_list, score_list = [],[]   # Initialize lists. These two lists need to store outputs from my function

name2 = df['name2']              # Append new name2 to this list as I iterate
name1 = df['name1']              # Append new name1 to this list as I iterate
distinct_name1 = set(name1)      # distinct name1. I need this to calculate the difference
diff = set(name2) ^ distinct_name1 # This calculates the difference. I need to iterate until this list is empty, i.e., when len(diff)=0


if df.Name2.isnull().all():  # this condition is to start the process. At the beginning I have only values in Name1. No values in Name2

    if len(diff)>0: # in the example the difference is 2 at the beginning, i.e., John and Patty; at the second round 3 (Chris, Luke, Martin); at the third round is only for Laura. There is no fourth round 
         for x in diff: # I run it first for John, then for Patty
            collected_data = fun(df, diff) # I will explain below what this function does and how it looks like
    
        df = df.apply(pd.Series.explode) # in this step I explode the dataset

        name2 = df['Name2']             # I am updating the list of values in Name2 to calculate the difference after each iteration. 
        name1 = df['Name1']             # I am updating the list of values in Name1 to calculate the difference after each iteration. 
        distinct_name1 = set(name1)    # calculate the new difference
        diff = filter(None, (set(name2) ^ distinct_name1) ) # calculate the new difference. Iterate until this is empty 

The function looks like this:

from bs4 import BeautifulSoup
import requests
from ast import literal_eval

def fun(df, diff):
    for x in diff:  # run the function for each name in Name2 that are not in Name1
        url = "https://www.website.com/siteinfo/"+ x # I put a fake name website
        soup = BeautifulSoup(requests.get(url).content, "html.parser")

        websites = soup.select("#card_mini_audience .site>a")   # Selects the websites in the table
        scores = soup.select("#card_mini_audience .overlap>.truncation")    # Selects the corresponding scores

        for pair in zip(websites, scores): # this is the type of output of the function, i.e., (name2, score)
        name2_list.append(websites) 
        score_list.append(scores)
                    
    df['Name2'] = name2_list # this is currently causing an error. The length of value is different from that one of the index
    df['Score'] = score_list # this probably will cause the same error as the above one.
    
    return df

An error occurs when I consider this step df['Name2'] = name2_list in the function:

---> 33 df['Name2'] = name2_list

saying:

ValueError: Length of values (6) does not match length of index (8).

(the values inside the round brackets may be different from those ones that you could get by using this example)

My function currently does not care how many rows are in the dataframe and it is creating new lists of some different length. I would need to find a way to reconcile this. I was debugging and I can confirm that the error comes from df['Name2'] = name2_list in the function. I am able to correctly print the list of new name2 values, but not the column. Maybe, a possible solution could be to build the df once outside of the for loop, but I need to explode df['Name2'] and build lists where to store results from the web.

I have no idea on how to fix it, and I really hope in your help.



from Length of values does not match length of index during a for loop

No comments:

Post a Comment