1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Combine cross between 2 dataframe efficiently

Discussão em 'Python' iniciado por Stack, Setembro 12, 2024.

  1. Stack

    Stack Membro Participativo

    I am working with 2 datasets. One describes some time windows by their start and stop times. The second one contains a big list of events with their corresponding timestamps.

    I want to combine this into a single dataframe that contains the start and stop time of each window, together with how many events happened during this time window.

    I have managed to "solve" my problem with:

    import polars as pl

    actions = {
    "id": ["a", "a", "a", "a", "b", "b", "a", "a"],
    "action": ["start", "stop", "start", "stop", "start", "stop", "start", "stop"],
    "time": [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
    }

    events = {
    "name": ["x", "x", "x", "y", "y", "z", "w", "w", "w"],
    "time": [0.0, 0.1, 0.5, 1.1, 2.5, 3.0, 4.5, 4.9, 5.5],
    }

    actions_df = (
    pl.DataFrame(actions)
    .group_by("id")
    .agg(
    start=pl.col("time").filter(pl.col("action") == "start"),
    stop=pl.col("time").filter(pl.col("action") == "stop"),
    )
    .explode(["start", "stop"])
    )

    df = (
    actions_df.join(pl.DataFrame(events), how="cross")
    .filter((pl.col("time") >= pl.col("start")) & (pl.col("time") <= pl.col("stop")))
    .group_by(["id", "start", "stop", "name"])
    .agg(count=pl.count("name"))
    .pivot("name", index=["id", "start", "stop"], values="count")
    .fill_null(0)
    )

    result_df = (
    actions_df.join(df, on=["id", "start", "stop"], how="left")
    .fill_null(0)
    .sort("start")
    )

    print(result_df)
    """
    ┌─────┬───────┬──────┬─────┬─────┬─────┬─────┐
    │ id ┆ start ┆ stop ┆ w ┆ y ┆ x ┆ z │
    │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
    │ str ┆ f64 ┆ f64 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
    ╞═════╪═══════╪══════╪═════╪═════╪═════╪═════╡
    │ a ┆ 0.0 ┆ 1.0 ┆ 0 ┆ 0 ┆ 3 ┆ 0 │
    │ a ┆ 2.0 ┆ 3.0 ┆ 0 ┆ 1 ┆ 0 ┆ 1 │
    │ b ┆ 4.0 ┆ 5.0 ┆ 2 ┆ 0 ┆ 0 ┆ 0 │
    │ a ┆ 6.0 ┆ 7.0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
    └─────┴───────┴──────┴─────┴─────┴─────┴─────┘
    """


    My issue is that this approach "explodes" in RAM and my process gets killed. I guess that the join(... how="cross") makes my dataframe huge, just to then ignore most of it again.

    Can I get some help/hints on a better way to solve this?

    To give some orders of magnitude, my "actions" datasets have on the order of 100-500 time windows (~1 MB), and my "events" datasets have on the order of ~10 million (~200 MB). And I am getting my process killed with 16 GB of RAM.

    EDIT In real data, my intervals can be overlapping. Thanks to @RomanPekar for bringing this up.

    Continue reading...

Compartilhe esta Página