1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] How to accelerate getting points within distance using two DataFrames?

Discussão em 'Python' iniciado por Stack, Outubro 8, 2024.

  1. Stack

    Stack Membro Participativo

    I have two DataFrames (df and locations_df), and both have longitude and latitude values. I'm trying to find the df's points within 2 km of each row of locations_df.

    I tried to vectorize the function, but the speed is still slow when locations_df is a big DataFrame (nrows>1000). Any idea how to accelerate?

    import pandas as pd
    import numpy as np

    def select_points_for_multiple_locations_vectorized(df, locations_df, radius_km):
    R = 6371 # Earth's radius in kilometers

    # Convert degrees to radians
    df_lat_rad = np.radians(df['latitude'].values)[:, np.newaxis]
    df_lon_rad = np.radians(df['longitude'].values)[:, np.newaxis]
    loc_lat_rad = np.radians(locations_df['lat'].values)
    loc_lon_rad = np.radians(locations_df['lon'].values)

    # Haversine formula (vectorized)
    dlat = df_lat_rad - loc_lat_rad
    dlon = df_lon_rad - loc_lon_rad
    a = np.sin(dlat/2)**2 + np.cos(df_lat_rad) * np.cos(loc_lat_rad) * np.sin(dlon/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    distances = R * c

    # Create a mask for points within the radius
    mask = distances <= radius_km

    # Get indices of True values in the mask
    indices = np.where(mask)

    result = pd.concat([df.iloc[indices[0]].reset_index(drop=True), locations_df.iloc[indices[1]].reset_index(drop=True)], axis=1)

    return result

    def random_lat_lon(n=1, lat_min=-10., lat_max=10., lon_min=-5., lon_max=5.):
    """
    this code produces an array with pairs lat, lon
    """
    lat = np.random.uniform(lat_min, lat_max, n)
    lon = np.random.uniform(lon_min, lon_max, n)

    return np.array(tuple(zip(lat, lon)))

    df = pd.DataFrame(random_lat_lon(n=10000000), columns=['latitude', 'longitude'])
    locations_df = pd.DataFrame(random_lat_lon(n=20), columns=['lat', 'lon'])

    result = select_points_for_multiple_locations_vectorized(df, locations_df, radius_km=2)

    Continue reading...

Compartilhe esta Página