1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] What to return in a preprocessing function for optimal performance

Discussão em 'Python' iniciado por Stack, Outubro 6, 2024 às 19:52.

  1. Stack

    Stack Membro Participativo

    My starting point is a large polars LazyFrame.

    I have written a function (def preprocessing(df:pl.Lazyframe) that has to take in a pl.LazyFrame (it can unfortunately not be a pl.Expr).
    The function which does a long chain of operations on different columns of the lazyframe. Finally the output/return of the preprocessing function is only on final column that is to be appended to the starting/incoming lazyframe.

    What would now be the best/optimized return type for the query planner to be integrated in the starting data frame? Should I just do a .collect().to_series(), and append it via .with_columns()? But that would require a collect and thus scarefice potential for optimization. Is there a way arround returning a series and collecting?

    Unfortunately I can not share the code. However here is pseudo/mock up code:

    def preprocessing(
    df: pl.LazyFrame | pl.DataFrame,
    ) -> pl.Series : #? tbd return type
    return (
    df.lazy()
    .with_columns(
    [
    pl.col("some_col")
    .cast(pl.Int64)
    .cast(pl.Utf8)
    .str.pad_start(5, "0")
    .alias("some_col_preprocessed"),
    pl.col("other_col")
    .cast(pl.Int64)
    .cast(pl.Utf8)
    .str.pad_start(7, "0")
    .alias("other_col_preprocessed"),
    ],
    )
    # ... more steps
    .select(
    pl.concat_str(
    [
    pl.col("some_col_preprocessed"),
    pl.col("other_col_preprocessed"),
    pl.lit("42133723"),
    ],
    )
    ).collect().to_series()

    starting_df = starting_df.with_columns(final_col=preprocessing(starting_df))

    Downsides of returning series


    I just realized a major downside of the current approach (.collect().to_series) might put me in danger if one of the steps changes the order / or had me forget the maintain_order argument.

    Other Alternatives

    • Finding an ID column and using a join
    • Using the with_columns expression context and just passing in, processing and returing the complete lazyframe, without even calling .collect()

    Continue reading...

Compartilhe esta Página