I have a dataset (df) like this
Name1 Name2 Score
John NaN NaN
Patty NaN NaN
where Name2 and Score are initialized to NaN. Some data, like the following
name2_list=[[Chris, Luke, Martin], [Martin]]
score_list=[[1,2,4],[3],[]]
is generated at each loop from a function. These two lists need to be added to columns Name2 and Score in my df, in order to have:
Name1 Name2 Score
John [Chris, Luke, Martin] [1,2,4]
Patty [Martin] [3]
Then, since I want to have values and not lists in Name2 and Score, I expand the dataset:
Name1 Name2 Name3
John Chris 1
John Luke 2
John Martin 4
Patty Martin 3
My goal is to have all values in Name2 in Name1. However, as I mentioned, I have a function that works as follows: for each element in Name2, not in Name1, it checks if there are further values. These values generated are similar to those ones seen for name2_list and score_list. For example, let's say that, at the second iteration, Chris has values generated from the function equal to [Patty] and 9; Luke has values [Martin] and 1; Martin has values [Laura] and 3. I need then to add these values again to my original df in order to have (before exploding)
Name1 Name2 Score
John Chris 1
John Luke 2
John Martin 4
Patty Martin 3
Chris Patty 9
Luke Martin 1
Martin Laura 3
Only one value, Laura, is not in Name1 yet, so I will need to run again the function: if the output is already included in Name1, then my loop stops and I get the final dataset; otherwise, I will need to rerun the function and see if more loops are required. To make it shorter in this example, let's suppose that the value of Laura after running the function is John, 3. John is already in Name1 so I do not need to rerun the function.
To recap:
- my initial dataset had only
JohnandPattyinName1, andName2andScorewere initialized toNaN; - since it is the first iteration,
Name2is empty, so only in this case we calculate the difference betweenName2andName1, excludingNaNvalues: we run the function once forJohnand once forPatty; - the output for
Johnis[Chris, Luke, Martin]and[1,2,4], while that one forLaurais[Martin],[3]. The function returns an output as follows:[[Chris, Luke, Martin],[Martin]]and[[1,2,4],[3]]; - I add these results in column
Name2andScore, then I expand the df in order to get values and not lists. - I calculate the difference between
Name2andName1: I run the function only for those values inName2that are NOT inName1. 6)repeat bullet 3), 4) and 5). The loop stops when all the values inName2are also inName1.
Expected output:
Name1 Name2 Score
John Chris 1
John Luke 2
John Martin 4
Patty Martin 3
Chris Patty 9
Luke Martin 1
Martin Laura 3
Laura John 3
Summarizing the problem:
- the goal is to have all the (distinct) values in
Name2inName1 - the function is applied only to those values in
Name2that are not inName1yet - every time (i.e., at each iteration) I need to check the difference between values in
Name2and values inName1, as the loop ends when the length of this difference is0, i.e., when all the values inName2are inName1. - since the output of the function is a list of lists (for both
Name2andScore), I need to explode my columns/dataframe, in order to work with values and not with list.
If something is still not clear in the algorithm, please let me know. What I have done is the following:
name2_list, score_list = [],[] # Initialize lists. These two lists need to store outputs from my function
name2 = df['name2'] # Append new name2 to this list as I iterate
name1 = df['name1'] # Append new name1 to this list as I iterate
distinct_name1 = set(name1) # distinct name1. I need this to calculate the difference
diff = set(name2) ^ distinct_name1 # This calculates the difference. I need to iterate until this list is empty, i.e., when len(diff)=0
if df.Name2.isnull().all(): # this condition is to start the process. At the beginning I have only values in Name1. No values in Name2
if len(diff)>0: # in the example the difference is 2 at the beginning, i.e., John and Patty; at the second round 3 (Chris, Luke, Martin); at the third round is only for Laura. There is no fourth round
for x in diff: # I run it first for John, then for Patty
collected_data = fun(df, diff) # I will explain below what this function does and how it looks like
df = df.apply(pd.Series.explode) # in this step I explode the dataset
name2 = df['Name2'] # I am updating the list of values in Name2 to calculate the difference after each iteration.
name1 = df['Name1'] # I am updating the list of values in Name1 to calculate the difference after each iteration.
distinct_name1 = set(name1) # calculate the new difference
diff = filter(None, (set(name2) ^ distinct_name1) ) # calculate the new difference. Iterate until this is empty
The function looks like this:
from bs4 import BeautifulSoup
import requests
from ast import literal_eval
def fun(df, diff):
for x in diff: # run the function for each name in Name2 that are not in Name1
url = "https://www.website.com/siteinfo/"+ x # I put a fake name website
soup = BeautifulSoup(requests.get(url).content, "html.parser")
websites = soup.select("#card_mini_audience .site>a") # Selects the websites in the table
scores = soup.select("#card_mini_audience .overlap>.truncation") # Selects the corresponding scores
for pair in zip(websites, scores): # this is the type of output of the function, i.e., (name2, score)
name2_list.append(websites)
score_list.append(scores)
df['Name2'] = name2_list # this is currently causing an error. The length of value is different from that one of the index
df['Score'] = score_list # this probably will cause the same error as the above one.
return df
An error occurs when I consider this step df['Name2'] = name2_list in the function:
---> 33 df['Name2'] = name2_list
saying:
ValueError: Length of values (6) does not match length of index (8).
(the values inside the round brackets may be different from those ones that you could get by using this example)
My function currently does not care how many rows are in the dataframe and it is creating new lists of some different length. I would need to find a way to reconcile this. I was debugging and I can confirm that the error comes from df['Name2'] = name2_list in the function. I am able to correctly print the list of new name2 values, but not the column. Maybe, a possible solution could be to build the df once outside of the for loop, but I need to explode df['Name2'] and build lists where to store results from the web.
I have no idea on how to fix it, and I really hope in your help.
from Length of values does not match length of index during a for loop
No comments:
Post a Comment