Saturday, 9 September 2023

How do I GroupShuffleSplit a parquet dataframe lazily?

I have a parquet dataset that looks like this:

>>> df.head()

|    |   game_size | match_id                                                         |   party_size |   player_assists |   player_kills | player_name   |   team_id |   team_placement |   team_kills |   team_assists |   kill_ratio |   assist_ratio |
|---:|------------:|:-----------------------------------------------------------------|-------------:|-----------------:|---------------:|:--------------|----------:|-----------------:|-------------:|---------------:|-------------:|---------------:|
|  0 |          37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO |            2 |                0 |              1 | SnuffIes      |         4 |               18 |            2 |              0 |          0.5 |            0.5 |
|  1 |          37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO |            2 |                0 |              1 | Ozon3r        |         4 |               18 |            2 |              0 |          0.5 |            0.5 |
|  2 |          37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO |            2 |                0 |              0 | bovize        |         5 |               33 |            0 |              0 |          0.5 |            0.5 |
|  3 |          37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO |            2 |                0 |              0 | sbahn87       |         5 |               33 |            0 |              0 |          0.5 |            0.5 |
|  4 |          37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO |            2 |                0 |              2 | GeminiZZZ     |        14 |               11 |            2 |              0 |          1   |            0.5 |

I need a python solution, preferably with dask, pandas or another dataframe library. Currently pandas doesn't support lazy evaluation, as the dataset is quite large. So it seems out of the question to use pandas. Dask on the other hand doesn't support any of the sklearn.model_selection splitters since it doesn't have integer based indexing support.

Ideally a simple GroupShuffleSplit working with dask is all I need. Is there any other library that supports this? If so, how do I do this with parquet in a lazy way?



from How do I GroupShuffleSplit a parquet dataframe lazily?

No comments:

Post a Comment