I have a dataset with integer columns for 'year'
, 'month'
and 'day'
.
For a given start_date
and end_date
(pandas Timestamps), I want to build an expression to filter for all the rows between these dates.
I have the following working solution, which implements precedence of year over month over day manually:
import pandas as pd
import pyarrow.dataset as ds
import pyarrow.compute as pc
def get_partition_filter(start_time: pd.Timestamp,
end_time: pd.Timestamp) -> pc.Expression:
# Extract year, month, and day from start and end times
start_year, start_month, start_day = start_time.year, start_time.month, start_time.day
end_year, end_month, end_day = end_time.year, end_time.month, end_time.day
# Construct filter
return ((ds.field("year") > start_year) | ((ds.field("year") == start_year) & (ds.field("month") > start_month)) |
((ds.field("year") == start_year) & (ds.field("month") == start_month) & (ds.field("day") >= start_day))) & \
((ds.field("year") < end_year) | ((ds.field("year") == end_year) & (ds.field("month") < end_month)) |
((ds.field("year") == end_year) & (ds.field("month") == end_month) & (ds.field("day") <= end_day)))
I also have the following hack, which circumvents lexicographical ordering by treating the dates as one integer each and constructing a derived field.
def get_partition_filter_new(start_time: pd.Timestamp,
end_time: pd.Timestamp) -> pc.Expression:
# Construct ints from dates
start_date_int, end_date_int = (
int(time.strftime('%Y%m%d')) for time in (start_time, end_time)
)
# Construct the derived field
derived_field = (
ds.field("year")*10**4 +
ds.field("month")*10**2 +
ds.field("day")
)
# Construct the filter expression
return (derived_field >= start_date_int) & (derived_field <= end_date_int)
Is there a native way to do something like this in pyarrow
? To rephrase the question, I am looking for something like this (pseudo-code):
# PSEUDO-CODE
(start_year_int, start_month_int, start_day_int) <= (ds.field("year"), ds.field("month"), ds.field("day")) <= (end_year_int, end_month_int, end_day_int)
from Construct filtering expression based on lexicographical ordering
No comments:
Post a Comment