[Python] Filling preallocated pandas.DataFrame memory-efficiently

Stack · Novembro 5, 2024 às 13:22

I need to append a lot of rows (1 440 000 000) to a pandas.DataFrame.

I know the number of rows in advance, so I can preallocate it, then fill it with data in a C-like manner.

So far the best idea I have is pretty ugly:

N = 1000000
sham = [-1] * (N * len(THRESHOLDS) * len(OBJECTS)) # 1440000000
DATA = pd.DataFrame(
{
'threshold': pd.Categorical(sham, categories=THRESHOLDS, ordered=True),
'expected': pd.Series(sham, dtype=np.float16),
'iteration': pd.Series(sham, dtype=np.int32),
'analyser': pd.Categorical(sham, categories=ANALYSERS),
'object': pd.Categorical(sham, categories=OBJECTS),
},
columns=['threshold', 'expected', 'iteration', 'analyser', 'object'])
ptr = 0
for t in THRESHOLDS:
for o in OBJECTS:
for a in ANALYSERS:
for i in range(N):
DATA.iloc[ptr] = t, expectedMonteCarlo(o, a, t), i, a, o
ptr += 1

The question is, how can I make my code cleaner? I mean especially:

preallocate DATA without inflating it with sham list,

append rows to preallocated DATA without use of index?

The main problem is memory efficiency. Otherwise I would go for appending records to list object, and then converting it to pandas.DataFrame.

Continue reading...

Logar ou Criar uma Conta

[Python] Filling preallocated pandas.DataFrame memory-efficiently

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[Python] Filling preallocated pandas.DataFrame memory-efficiently

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis