1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[SQL] Redshift query duplicates

Discussão em 'Outras Linguagens' iniciado por Stack, Novembro 7, 2024 às 14:52.

  1. Stack

    Stack Membro Participativo

    I'm using python with redshift_connector, and analysing the data with pandas. When accessing a redshift db with selecting n columns, I got i lines. However when I wanted to add a new column to this query, it timed out after an hour. To solve the issue, I came up with the idea to select the n+1 columns, use LIMIT and OFFSET in an iterative manner to get every row. After a while it gave back i lines, but something did not add up. When I compared the results, the latter yielded a couple of duplicate rows. How can one write a query so that it would not time-out, but wouldn't give back duplicates?

    Original mock query that won't time out:

    SELECT a, b, c
    FROM table
    WHERE attribute IN ('attribute1','attribute2')


    Timeout:

    SELECT a, b, c, d
    FROM table
    WHERE attribute IN ('attribute1','attribute2')


    If I put the second one in a while True loop, amend it with the LIMIT and OFFSET, use pd.read_sql(query, connection) to get the data, append it to a df list, and concat the list in the end, it gives me back the exact amount of lines that the first one, but with duplicates.

    Continue reading...

Compartilhe esta Página