1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] How can I avoid using pl.DataFrame.iter_rows() and instead vectorize this

Discussão em 'Python' iniciado por Stack, Setembro 12, 2024.

  1. Stack

    Stack Membro Participativo

    I have two polars dataframes containing a unique ID and the name of a utility. I am trying to build a mapping of entries between these two dataframes. I am using polars_fuzzy_match to do a fuzzy string search against entries. My first dataframe (wg_df) is approximately a subset of the second (eia_df). In my code below I am passing each utility_name from wg_df into fuzzy_match_score run against the eia_utility_name. Can I avoid the rowise iteration and vectorize this?

    import polars as pl
    from polars_fuzzy_match import fuzzy_match_score

    # Sample data
    # wg_df is approximately a subset of eia_df.
    wg_df = pl.DataFrame({"wg_id": [1, 2], "utility_name": ["Utility A", "Utility B"]})

    eia_df = pl.DataFrame(
    {"eia_id": [101, 102, 103], "utility_name": ["Utility A co.", "Utility B", "utility c"]}
    )

    out = pl.DataFrame(
    schema=[
    ("wg_id", pl.Int64),
    ("eia_id", pl.Int64),
    ("wg_utility_name", pl.String),
    ("utility_name", pl.String),
    ("score", pl.UInt32),
    ],
    )

    # Iterate through each wg utility and find the best match in eia
    # can this be vectorized?
    for wg_id, utility in wg_df.iter_rows():
    res = (
    eia_df.with_columns(score=fuzzy_match_score(pl.col("utility_name"), utility))
    .filter(pl.col("score").is_not_null())
    .sort(by="score", descending=True)
    )
    # insert the wg_id and wg_utility_name into the results. They have to be put into the
    res.insert_column(
    0,
    pl.Series("wg_id", [wg_id] * len(res)),
    )
    res.insert_column(2, pl.Series("wg_utility_name", [utility] * len(res)))
    out = out.vstack(res.select([col_name for col_name in out.schema]))

    Continue reading...

Compartilhe esta Página