1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[SQL] Fuzzy Join on Venue Names Based on City

Discussão em 'Outras Linguagens' iniciado por Stack, Outubro 7, 2024 às 07:52.

  1. Stack

    Stack Membro Participativo

    I am working with PySpark and need to join two datasets based on the city and a fuzzy matching condition on the venue names. The first dataset contains information about stadiums including a unique venue_id, while the second dataset, which I receive periodically, only includes venue names and cities without the venue_id.

    I want to join these datasets to match the venue_name from the incoming data to the existing dataset using fuzzy logic (since the names are not always written identically), and then pull the corresponding venue_id.

    Existing Dataset (df_stadium_information):

    venue_name city venue_id
    Sree Kanteerava Stadium Bengaluru 1
    Sree Kanteerava Stadium Kochi 2
    Eden Gardens Kolkata 3
    Narendra Modi Stadium Ahmedabad 4

    Incoming Data (df_new_stadium_data):

    venue_name city
    Sri Kanteerava Indoor Stadium Bengaluru
    Eden Gardens Kolkata

    Desired Output:

    venue_name city venue_id
    Sri Kanteerava Indoor Stadium Bengaluru 1
    Eden Gardens Kolkata null

    I want the output to show the venue_id from df_stadium_information if there is a fuzzy match on venue_name and an exact match on city. If there's no fuzzy match, the venue_id should be null.

    Continue reading...

Compartilhe esta Página