1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Filling preallocated pandas.DataFrame memory-efficiently

Discussão em 'Python' iniciado por Stack, Novembro 5, 2024 às 13:22.

  1. Stack

    Stack Membro Participativo

    I need to append a lot of rows (1 440 000 000) to a pandas.DataFrame.

    I know the number of rows in advance, so I can preallocate it, then fill it with data in a C-like manner.

    So far the best idea I have is pretty ugly:

    N = 1000000
    sham = [-1] * (N * len(THRESHOLDS) * len(OBJECTS)) # 1440000000
    DATA = pd.DataFrame(
    {
    'threshold': pd.Categorical(sham, categories=THRESHOLDS, ordered=True),
    'expected': pd.Series(sham, dtype=np.float16),
    'iteration': pd.Series(sham, dtype=np.int32),
    'analyser': pd.Categorical(sham, categories=ANALYSERS),
    'object': pd.Categorical(sham, categories=OBJECTS),
    },
    columns=['threshold', 'expected', 'iteration', 'analyser', 'object'])
    ptr = 0
    for t in THRESHOLDS:
    for o in OBJECTS:
    for a in ANALYSERS:
    for i in range(N):
    DATA.iloc[ptr] = t, expectedMonteCarlo(o, a, t), i, a, o
    ptr += 1


    The question is, how can I make my code cleaner? I mean especially:

    • preallocate DATA without inflating it with sham list,
    • append rows to preallocated DATA without use of index?

    The main problem is memory efficiency. Otherwise I would go for appending records to list object, and then converting it to pandas.DataFrame.

    Continue reading...

Compartilhe esta Página