1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] How to Serialize and read metadata in Dask Dataframe

Discussão em 'Python' iniciado por Stack, Setembro 28, 2024 às 01:22.

  1. Stack

    Stack Membro Participativo

    I'm trying to pre-compute dask divisions and categories for certain columns in dask dataframe, then save data as new dataframe and reuse it down the road.

    Issue I encountered is that when I read dataframe back it there misses divisions and categories.

    This is the snippet:

    import dask.dataframe as dd

    df = dd.read_parquet('filename')
    cat_columns= ['state', ...]
    df= df.categorize(columns= cat_columns)
    df = df.set_index('rn', drop=True, sorted=True, npartitions=None, divisions=None, sort=False)
    df = df.persist(scheduler='single-threaded')
    df.to_parquet('filename2',write_metadata_file=True, write_index=True, compute=True)


    df2= dd.read_parquet('filename2')

    df2.divisions
    #above results in all None

    df2.state.cat.categories
    #above results : AttributeNotImplementedError: `df.column.cat.categories` with unknown categories is not supported.


    My question is what am I doing wrong here, is it saving a dataframe, loading or both?

    Continue reading...

Compartilhe esta Página