1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Performance optimal way to serialise Python objects with large Pandas DataFrames

Discussão em 'Python' iniciado por Stack, Setembro 28, 2024 às 13:42.

  1. Stack

    Stack Membro Participativo

    I am dealing with Python objects containing Pandas DataFrame and Numpy Series objects. These can be large, several millions of rows.

    E.g.


    @dataclass
    class MyWorld:
    # A lot of DataFrames with millions of rows
    samples: pd.DataFrame
    addresses: pd.DataFrame
    # etc.


    I need to cache these objects, and I am hoping to find an efficient and painless way to serialise them, instead of standard pickle.dump(). Are there any specialised Python serialisers for such objects that would pickle Series data with some efficient codec and compression automatically? Alternatively, I need to hand construct several Parquet files, but that requires a lot of more manual code to deal with this, and I'd rather avoid that if possible.

    Performance here may mean

    • Speed
    • File size (can be related, as you need to read less from the disk/network)

    I am aware of joblib.dump() which does some magic for these kind of objects, but based on the documentation I am not sure if this is relevant anymore.

    Continue reading...

Compartilhe esta Página