[Python] How to efficiently create an index-like Polars DataFrame from multiple sparse series?

Stack · Setembro 28, 2024 às 01:22

I would like to create a DataFrame that has an "index" (integer) from a number of (sparse) Series, where the index (or primary key) is NOT necessarily consecutive integers. Each Series is like a vector of (index, value) tuple or {index: value} mapping.

(1) A small example

In Pandas, this is very easy as we can create a DataFrame at a time, like

>>> pd.DataFrame({
"A": {0: 'a', 20: 'b', 40: 'c'},
"B": {10: 'd', 20: 'e', 30: 'f'},
"C": {20: 'g', 30: 'h'},
}).sort_index()

A B C
0 a NaN NaN
10 NaN d NaN
20 b e g
30 NaN f h
40 c NaN NaN

but I can't find an easy way to achieve a similar result with Polars. As described in Coming from Pandas, Polars does not use an index unlike Pandas, and each row is indexed by its integer position in the table; so I might need to represent an "indexed" Series with a 2-column DataFrame:

A = pl.DataFrame({ "index": [0, 20, 40], "A": ['a', 'b', 'c'] })
B = pl.DataFrame({ "index": [10, 20, 30], "B": ['d', 'e', 'f'] })
C = pl.DataFrame({ "index": [20, 30], "C": ['g', 'h'] })

I tried to combine these multiple DataFrames, joining on the index column:

>>> A.join(B, on='index', how='full', coalesce=True).join(C, on='index', how='full', coalesce=True).sort(by='index')

shape: (5, 4)
┌───────┬──────┬──────┬──────┐
│ index ┆ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str │
╞═══════╪══════╪══════╪══════╡
│ 0 ┆ a ┆ null ┆ null │
│ 10 ┆ null ┆ d ┆ null │
│ 20 ┆ b ┆ e ┆ g │
│ 30 ┆ null ┆ f ┆ h │
│ 40 ┆ c ┆ null ┆ null │
└───────┴──────┴──────┴──────┘

This gives the result I want, but I wonder:

(i) if there is there more concise way to do this over many columns, and

(ii) how make this operation as efficient as possible.

Alternatives?

I also tried outer joins as this is one way to combine Dataframes with different number of columns and rows, as described above.

Other alternatives I tried includes diagonal concatenation, but this does not deduplicate or join on index:

>>> pl.concat([A, B, C], how='diagonal')

index A B C
0 0 a None None
1 20 b None None
2 40 c None None
3 10 None d None
4 20 None e None
5 30 None f None
6 20 None None g
7 30 None None h

(2) Efficiently Building a Large Table

The approach I found above gives desired results I'd want but I feel there must be a better way in terms of performance. Consider a case with more large tables; say 300,000 rows and 20 columns:

N, C = 300000, 20
pls = []
pds = []

for i in range(C):
A = pl.DataFrame({
"index": np.linspace(i, N*3-i, num=N, dtype=np.int32),
f"A{i}": np.arange(N, dtype=np.float32),
})
pls.append(A)

B = A.to_pandas().set_index("index")
pds.append(B)

The approach of joining two columns in a row is somewhat slow than I expected:

%%time
F = functools.reduce(lambda a, b: a.join(b, on='index', how='full', coalesce=True), pls)
F.sort(by='index')

CPU times: user 1.49 s, sys: 97.8 ms, total: 1.59 s
Wall time: 611 ms

or than one-pass creation in pd.DataFrame:

%%time
pd.DataFrame({
f"A{i}": pds[f'A{i}'] for i in range(C)
}).sort_index()

CPU times: user 230 ms, sys: 50.7 ms, total: 281 ms
Wall time: 281 ms

Continue reading...

Logar ou Criar uma Conta

[Python] How to efficiently create an index-like Polars DataFrame from multiple sparse series?

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[Python] How to efficiently create an index-like Polars DataFrame from multiple sparse series?

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis