Tuesday, 4 September 2018

Assigning to pandas DataFrame column behaves differently depending on other columns

I’ve noticed that assigning to a pandas DataFrame column (using the .loc indexer) behaves differently depending on what other columns are present in the DataFrame and on the exact form of the assignment. Using three example DataFrames:

df1 = pandas.DataFrame({
    'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
})
#         col1
# 0  [1, 2, 3]
# 1  [4, 5, 6]
# 2  [7, 8, 9]
df2 = pandas.DataFrame({
    'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
    'col2': [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
})
#         col1          col2
# 0  [1, 2, 3]  [10, 20, 30]
# 1  [4, 5, 6]  [40, 50, 60]
# 2  [7, 8, 9]  [70, 80, 90]
df3 = pandas.DataFrame({
    'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
    'col2': [1, 2, 3]
})
#         col1  col2
# 0  [1, 2, 3]     1
# 1  [4, 5, 6]     2
# 2  [7, 8, 9]     3
x = numpy.array([[111, 222, 333],
                 [444, 555, 666],
                 [777, 888, 999]])

I’ve found the following:

  1. df1:

    1. df1.col1 = x

      Result:

      df1
      #    col1
      # 0   111
      # 1   444
      # 2   777
      
      
    2. df1.loc[:, 'col1'] = x

      Result:

      df1
      #    col1
      # 0   111
      # 1   444
      # 2   777
      
      
    3. df1.loc[0:2, 'col1'] = x

      Result:

      # […]
      # ValueError: could not broadcast input array from shape (3,3) into shape (3)
      
      
  2. df2:

    1. df2.col1 = x

      Result:

      df2
      #    col1          col2
      # 0   111  [10, 20, 30]
      # 1   444  [40, 50, 60]
      # 2   777  [70, 80, 90]
      
      
    2. df2.loc[:, 'col1'] = x

      Result:

      df2
      #    col1          col2
      # 0   111  [10, 20, 30]
      # 1   444  [40, 50, 60]
      # 2   777  [70, 80, 90]
      
      
    3. df2.loc[0:2, 'col1'] = x

      Result:

      # […]
      # ValueError: could not broadcast input array from shape (3,3) into shape (3)
      
      
  3. df3:

    1. df3.col1 = x

      Result:

      df3
      #    col1  col2
      # 0   111     1
      # 1   444     2
      # 2   777     3
      
      
    2. df3.loc[:, 'col1'] = x

      Result:

      # ValueError: Must have equal len keys and value when setting with an ndarray
      
      
    3. df3.loc[0:2, 'col1'] = x

      Result:

      # ValueError: Must have equal len keys and value when setting with an ndarray
      
      

So it seems that df.loc seems to behave differently if one of the other columns in the DataFrame does not have dtype object.

My question is:

  • Why would the presence of other columns make a difference in this kind of assignment?
  • Why are the different versions of the assignment not equivalent? In particular, why is the result in the cases which don’t result in ValueError that the DataFrame column is filled with the values of the first column of the numpy array?

Note: I’m not interested in discussing whether it makes sense to assign a column to a numpy array in this way. I only want to know about the differences in behavior, and whether this might count as a bug.



from Assigning to pandas DataFrame column behaves differently depending on other columns

No comments:

Post a Comment