1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] How to efficiently create an index-like Polars DataFrame from multiple sparse series?

Discussão em 'Python' iniciado por Stack, Setembro 28, 2024 às 01:22.

  1. Stack

    Stack Membro Participativo

    I would like to create a DataFrame that has an "index" (integer) from a number of (sparse) Series, where the index (or primary key) is NOT necessarily consecutive integers. Each Series is like a vector of (index, value) tuple or {index: value} mapping.

    (1) A small example


    In Pandas, this is very easy as we can create a DataFrame at a time, like

    >>> pd.DataFrame({
    "A": {0: 'a', 20: 'b', 40: 'c'},
    "B": {10: 'd', 20: 'e', 30: 'f'},
    "C": {20: 'g', 30: 'h'},
    }).sort_index()

    A B C
    0 a NaN NaN
    10 NaN d NaN
    20 b e g
    30 NaN f h
    40 c NaN NaN


    but I can't find an easy way to achieve a similar result with Polars. As described in Coming from Pandas, Polars does not use an index unlike Pandas, and each row is indexed by its integer position in the table; so I might need to represent an "indexed" Series with a 2-column DataFrame:

    A = pl.DataFrame({ "index": [0, 20, 40], "A": ['a', 'b', 'c'] })
    B = pl.DataFrame({ "index": [10, 20, 30], "B": ['d', 'e', 'f'] })
    C = pl.DataFrame({ "index": [20, 30], "C": ['g', 'h'] })


    I tried to combine these multiple DataFrames, joining on the index column:

    >>> A.join(B, on='index', how='full', coalesce=True).join(C, on='index', how='full', coalesce=True).sort(by='index')

    shape: (5, 4)
    ┌───────┬──────┬──────┬──────┐
    │ index ┆ A ┆ B ┆ C │
    │ --- ┆ --- ┆ --- ┆ --- │
    │ i64 ┆ str ┆ str ┆ str │
    ╞═══════╪══════╪══════╪══════╡
    │ 0 ┆ a ┆ null ┆ null │
    │ 10 ┆ null ┆ d ┆ null │
    │ 20 ┆ b ┆ e ┆ g │
    │ 30 ┆ null ┆ f ┆ h │
    │ 40 ┆ c ┆ null ┆ null │
    └───────┴──────┴──────┴──────┘


    This gives the result I want, but I wonder:

    • (i) if there is there more concise way to do this over many columns, and
    • (ii) how make this operation as efficient as possible.
    Alternatives?


    I also tried outer joins as this is one way to combine Dataframes with different number of columns and rows, as described above.

    Other alternatives I tried includes diagonal concatenation, but this does not deduplicate or join on index:

    >>> pl.concat([A, B, C], how='diagonal')

    index A B C
    0 0 a None None
    1 20 b None None
    2 40 c None None
    3 10 None d None
    4 20 None e None
    5 30 None f None
    6 20 None None g
    7 30 None None h

    (2) Efficiently Building a Large Table


    The approach I found above gives desired results I'd want but I feel there must be a better way in terms of performance. Consider a case with more large tables; say 300,000 rows and 20 columns:

    N, C = 300000, 20
    pls = []
    pds = []

    for i in range(C):
    A = pl.DataFrame({
    "index": np.linspace(i, N*3-i, num=N, dtype=np.int32),
    f"A{i}": np.arange(N, dtype=np.float32),
    })
    pls.append(A)

    B = A.to_pandas().set_index("index")
    pds.append(B)


    The approach of joining two columns in a row is somewhat slow than I expected:

    %%time
    F = functools.reduce(lambda a, b: a.join(b, on='index', how='full', coalesce=True), pls)
    F.sort(by='index')

    CPU times: user 1.49 s, sys: 97.8 ms, total: 1.59 s
    Wall time: 611 ms


    or than one-pass creation in pd.DataFrame:

    %%time
    pd.DataFrame({
    f"A{i}": pds[f'A{i}'] for i in range(C)
    }).sort_index()

    CPU times: user 230 ms, sys: 50.7 ms, total: 281 ms
    Wall time: 281 ms

    Continue reading...

Compartilhe esta Página