I’ve noticed that assigning to a pandas DataFrame column (using the .loc indexer) behaves differently depending on what other columns are present in the DataFrame and on the exact form of the assignment. Using three example DataFrames:
df1 = pandas.DataFrame({
'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
})
# col1
# 0 [1, 2, 3]
# 1 [4, 5, 6]
# 2 [7, 8, 9]
df2 = pandas.DataFrame({
'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
'col2': [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
})
# col1 col2
# 0 [1, 2, 3] [10, 20, 30]
# 1 [4, 5, 6] [40, 50, 60]
# 2 [7, 8, 9] [70, 80, 90]
df3 = pandas.DataFrame({
'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
'col2': [1, 2, 3]
})
# col1 col2
# 0 [1, 2, 3] 1
# 1 [4, 5, 6] 2
# 2 [7, 8, 9] 3
x = numpy.array([[111, 222, 333],
[444, 555, 666],
[777, 888, 999]])
I’ve found the following:
-
df1:-
df1.col1 = xResult:
df1 # col1 # 0 111 # 1 444 # 2 777 -
df1.loc[:, 'col1'] = xResult:
df1 # col1 # 0 111 # 1 444 # 2 777 -
df1.loc[0:2, 'col1'] = xResult:
# […] # ValueError: could not broadcast input array from shape (3,3) into shape (3)
-
-
df2:-
df2.col1 = xResult:
df2 # col1 col2 # 0 111 [10, 20, 30] # 1 444 [40, 50, 60] # 2 777 [70, 80, 90] -
df2.loc[:, 'col1'] = xResult:
df2 # col1 col2 # 0 111 [10, 20, 30] # 1 444 [40, 50, 60] # 2 777 [70, 80, 90] -
df2.loc[0:2, 'col1'] = xResult:
# […] # ValueError: could not broadcast input array from shape (3,3) into shape (3)
-
-
df3:-
df3.col1 = xResult:
df3 # col1 col2 # 0 111 1 # 1 444 2 # 2 777 3 -
df3.loc[:, 'col1'] = xResult:
# ValueError: Must have equal len keys and value when setting with an ndarray -
df3.loc[0:2, 'col1'] = xResult:
# ValueError: Must have equal len keys and value when setting with an ndarray
-
So it seems that df.loc seems to behave differently if one of the other columns in the DataFrame does not have dtype object.
My question is:
- Why would the presence of other columns make a difference in this kind of assignment?
- Why are the different versions of the assignment not equivalent? In particular, why is the result in the cases which don’t result in
ValueErrorthat theDataFramecolumn is filled with the values of the first column of thenumpyarray?
Note: I’m not interested in discussing whether it makes sense to assign a column to a numpy array in this way. I only want to know about the differences in behavior, and whether this might count as a bug.
from Assigning to pandas DataFrame column behaves differently depending on other columns
No comments:
Post a Comment