[Python] Performance optimal way to serialise Python objects with large Pandas DataFrames

Stack · Setembro 28, 2024 às 13:42

I am dealing with Python objects containing Pandas DataFrame and Numpy Series objects. These can be large, several millions of rows.

E.g.

@dataclass
class MyWorld:
# A lot of DataFrames with millions of rows
samples: pd.DataFrame
addresses: pd.DataFrame
# etc.

I need to cache these objects, and I am hoping to find an efficient and painless way to serialise them, instead of standard pickle.dump(). Are there any specialised Python serialisers for such objects that would pickle Series data with some efficient codec and compression automatically? Alternatively, I need to hand construct several Parquet files, but that requires a lot of more manual code to deal with this, and I'd rather avoid that if possible.

Performance here may mean

Speed

File size (can be related, as you need to read less from the disk/network)

I am aware of joblib.dump() which does some magic for these kind of objects, but based on the documentation I am not sure if this is relevant anymore.

Continue reading...

Logar ou Criar uma Conta

[Python] Performance optimal way to serialise Python objects with large Pandas DataFrames

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[Python] Performance optimal way to serialise Python objects with large Pandas DataFrames

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis