Sunday, 1 September 2019

Populate month wise dataframe from two date columns

I have a PySpark dataframe like this,

+----------+--------+----------+----------+
|id_       | p      |d1        |  d2      |
+----------+--------+----------+----------+
|  1       | A      |2018-09-26|2018-10-26|
|  2       | B      |2018-06-21|2018-07-19|
|  2       | B      |2018-08-13|2018-10-07|
|  2       | B      |2018-12-31|2019-02-27|
|  2       | B      |2019-05-28|2019-06-25|
|  3       |C       |2018-06-15|2018-07-13|
|  3       |C       |2018-08-15|2018-10-09|
|  3       |C       |2018-12-03|2019-03-12|
|  3       |C       |2019-05-10|2019-06-07|
| 4        | A      |2019-01-30|2019-03-01|
| 4        | B      |2019-05-30|2019-07-25|
| 5        |C       |2018-09-19|2018-10-17|
-------------------------------------------

From this dataframe I have to derive another dataframe which have n columns. Where each column is a month from month(min(d1)) to month(max(d2)).

I want a in the derived dataframe for a row in the actual dataframe and the column values must be number of days in that month.

For example,

for first row, where id_ is 1 and p is A, I want to get a row in the derived dataframe where column of 201809 with value 5 and column 201810 with value 26.

For second row where id_ is 2 and p is B, I want to get a row in the derived dataframe where column of 201806 should be 9 and 201807 should be 19.

For the second last row, I want the columns 201905 filled with value 1, column 201906 with value 30, 201907 with 25.

So basically, I want the dataframe to be populated such a way that, for each row in my original dataframe I have a row in the derived dataframe where the columns in the table that corresponds to the month should be filled, for the range min(d1) to max(d2) with value number of days that is covered in that particular month.

I am currently doing this in the hard way. I am making n columns, where columns range for dates from min(d1) to max(d2). I am filling theses column with 1 and then melting the data and filtering based on value. Finally aggregating this dataframe to get my desired result, then selected the max valued p.

In codes,

d = df.select(F.min('d1').alias('d1'), F.max('d2').alias('d2')).first()

cols = [ c.strftime('%Y-%m-%d') for c in pd.period_range(d.d1, d.d2, freq='D') ]

result = df.select('id_', 'p', *[ F.when((df.d1 <= c)&(df.d2 >= c), 1).otherwise(0).alias(c) for c in cols ])

melted_data = melt(result, id_vars=['id_','p'], value_vars=cols)

melted_data = melted_data.withColumn('Month', F.substring(F.regexp_replace('variable', '-', ''), 1, 6))

melted_data = melted_data.groupBy('id_', 'Month', 'p').agg(F.sum('value').alias('days'))

melted_data = melted_data.orderBy('id_', 'Month', 'days', ascending=[False, False, False])

final_data = melted_data.groupBy('id_', 'Month').agg(F.first('p').alias('p'))

This codes takes a lot of time to run in decent configurations. How can I improve this.?

How can I achieve this task in a more optimized manner.? Making every single date in the range dont seems to be the best solution.

A small sample of the needed output is shown below,

+---+---+----------+----------+----------+----------+-------+
|id_|p  |201806    |201807    |201808    | 201809   | 201810|
+---+---+----------+----------+----------+----------+-------+
| 1 | A |         0|        0 |         0|      4   | 26    |
| 2 | B |         9|        19|         0|      0   | 0     |
| 2 | B |         0|        0 |        18|      30  | 7     |
+---+---+----------+----------+----------+----------+-------+



from Populate month wise dataframe from two date columns

No comments:

Post a Comment