1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] collect duckdb query into two dataframes?

Discussão em 'Python' iniciado por Stack, Outubro 1, 2024 às 10:02.

  1. Stack

    Stack Membro Participativo

    Say I have a csv file with

    date,value
    2020-01-01,1
    2020-01-02,4
    2020-01-03,5
    2020-01-04,9
    2020-01-05,2


    I would like to read it with duckdb, do some preprocessing, and ultimately end up with a train and validation set as Polars dataframes

    I could do:

    train = duckdb.sql("""
    select *, avg(value) over (order by date rows between 2 preceding and current row)
    from read_csv(my_data.csv) qualify date < make_date(2020,1,4)
    """).pl()
    val = duckdb.sql("""
    select *, avg(value) over (order by date rows between 2 preceding and current row)
    from read_csv(my_data.csv) qualify date >= make_date(2020,1,4)
    """).pl()


    and this works, but doesn't it risk double-computing things?

    Is there a way materialize two dataframes at once without double-computing things? Or should I just do

    data = duckdb.sql('select *, avg(value) over (order by date rows between 2 preceding and current row) from read_csv(my_data.csv)').pl()
    train = data.filter(pl.col('date') < date(2020, 1, 4))
    val = data.filter(pl.col('date') >= date(2020, 1, 4))


    ?

    Continue reading...

Compartilhe esta Página