I am working with 2 datasets. One describes some time windows by their start and stop times. The second one contains a big list of events with their corresponding timestamps. I want to combine this into a single dataframe that contains the start and stop time of each window, together with how many events happened during this time window. I have managed to "solve" my problem with: import polars as pl actions = { "id": ["a", "a", "a", "a", "b", "b", "a", "a"], "action": ["start", "stop", "start", "stop", "start", "stop", "start", "stop"], "time": [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0], } events = { "name": ["x", "x", "x", "y", "y", "z", "w", "w", "w"], "time": [0.0, 0.1, 0.5, 1.1, 2.5, 3.0, 4.5, 4.9, 5.5], } actions_df = ( pl.DataFrame(actions) .group_by("id") .agg( start=pl.col("time").filter(pl.col("action") == "start"), stop=pl.col("time").filter(pl.col("action") == "stop"), ) .explode(["start", "stop"]) ) df = ( actions_df.join(pl.DataFrame(events), how="cross") .filter((pl.col("time") >= pl.col("start")) & (pl.col("time") <= pl.col("stop"))) .group_by(["id", "start", "stop", "name"]) .agg(count=pl.count("name")) .pivot("name", index=["id", "start", "stop"], values="count") .fill_null(0) ) result_df = ( actions_df.join(df, on=["id", "start", "stop"], how="left") .fill_null(0) .sort("start") ) print(result_df) """ ┌─────┬───────┬──────┬─────┬─────┬─────┬─────┐ │ id ┆ start ┆ stop ┆ w ┆ y ┆ x ┆ z │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │ ╞═════╪═══════╪══════╪═════╪═════╪═════╪═════╡ │ a ┆ 0.0 ┆ 1.0 ┆ 0 ┆ 0 ┆ 3 ┆ 0 │ │ a ┆ 2.0 ┆ 3.0 ┆ 0 ┆ 1 ┆ 0 ┆ 1 │ │ b ┆ 4.0 ┆ 5.0 ┆ 2 ┆ 0 ┆ 0 ┆ 0 │ │ a ┆ 6.0 ┆ 7.0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 │ └─────┴───────┴──────┴─────┴─────┴─────┴─────┘ """ My issue is that this approach "explodes" in RAM and my process gets killed. I guess that the join(... how="cross") makes my dataframe huge, just to then ignore most of it again. Can I get some help/hints on a better way to solve this? To give some orders of magnitude, my "actions" datasets have on the order of 100-500 time windows (~1 MB), and my "events" datasets have on the order of ~10 million (~200 MB). And I am getting my process killed with 16 GB of RAM. EDIT In real data, my intervals can be overlapping. Thanks to @RomanPekar for bringing this up. Continue reading...