Monday, 4 September 2023

Construct filtering expression based on lexicographical ordering

I have a dataset with integer columns for 'year', 'month' and 'day'.

For a given start_date and end_date (pandas Timestamps), I want to build an expression to filter for all the rows between these dates.

I have the following working solution, which implements precedence of year over month over day manually:

import pandas as pd
import pyarrow.dataset as ds
import pyarrow.compute as pc

def get_partition_filter(start_time: pd.Timestamp,
                         end_time: pd.Timestamp) -> pc.Expression:
    # Extract year, month, and day from start and end times
    start_year, start_month, start_day = start_time.year, start_time.month, start_time.day
    end_year, end_month, end_day = end_time.year, end_time.month, end_time.day

    # Construct filter
    return ((ds.field("year") > start_year) | ((ds.field("year") == start_year) & (ds.field("month") > start_month)) |
           ((ds.field("year") == start_year) & (ds.field("month") == start_month) & (ds.field("day") >= start_day))) & \
           ((ds.field("year") < end_year) | ((ds.field("year") == end_year) & (ds.field("month") < end_month)) |
           ((ds.field("year") == end_year) & (ds.field("month") == end_month) & (ds.field("day") <= end_day)))

I also have the following hack, which circumvents lexicographical ordering by treating the dates as one integer each and constructing a derived field.

def get_partition_filter_new(start_time: pd.Timestamp,
                             end_time: pd.Timestamp) -> pc.Expression:
    # Construct ints from dates
    start_date_int, end_date_int = (
        int(time.strftime('%Y%m%d')) for time in (start_time, end_time)
    )

    # Construct the derived field
    derived_field = (
        ds.field("year")*10**4 +
        ds.field("month")*10**2 +
        ds.field("day")
    )

    # Construct the filter expression
    return (derived_field >= start_date_int) & (derived_field <= end_date_int)

Is there a native way to do something like this in pyarrow? To rephrase the question, I am looking for something like this (pseudo-code):

# PSEUDO-CODE
(start_year_int, start_month_int, start_day_int) <= (ds.field("year"), ds.field("month"), ds.field("day")) <= (end_year_int, end_month_int, end_day_int) 


from Construct filtering expression based on lexicographical ordering

No comments:

Post a Comment