I have a parquet dataset that looks like this:
>>> df.head()
| | game_size | match_id | party_size | player_assists | player_kills | player_name | team_id | team_placement | team_kills | team_assists | kill_ratio | assist_ratio |
|---:|------------:|:-----------------------------------------------------------------|-------------:|-----------------:|---------------:|:--------------|----------:|-----------------:|-------------:|---------------:|-------------:|---------------:|
| 0 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 1 | SnuffIes | 4 | 18 | 2 | 0 | 0.5 | 0.5 |
| 1 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 1 | Ozon3r | 4 | 18 | 2 | 0 | 0.5 | 0.5 |
| 2 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 0 | bovize | 5 | 33 | 0 | 0 | 0.5 | 0.5 |
| 3 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 0 | sbahn87 | 5 | 33 | 0 | 0 | 0.5 | 0.5 |
| 4 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 2 | GeminiZZZ | 14 | 11 | 2 | 0 | 1 | 0.5 |
I need a python solution, preferably with dask, pandas or another dataframe library. Currently pandas doesn't support lazy evaluation, as the dataset is quite large. So it seems out of the question to use pandas. Dask on the other hand doesn't support any of the sklearn.model_selection
splitters since it doesn't have integer based indexing support.
Ideally a simple GroupShuffleSplit
working with dask is all I need. Is there any other library that supports this? If so, how do I do this with parquet in a lazy way?
from How do I GroupShuffleSplit a parquet dataframe lazily?
No comments:
Post a Comment