[SQL] Redshift query duplicates

Stack · Novembro 7, 2024 às 14:52

I'm using python with redshift_connector, and analysing the data with pandas. When accessing a redshift db with selecting n columns, I got i lines. However when I wanted to add a new column to this query, it timed out after an hour. To solve the issue, I came up with the idea to select the n+1 columns, use LIMIT and OFFSET in an iterative manner to get every row. After a while it gave back i lines, but something did not add up. When I compared the results, the latter yielded a couple of duplicate rows. How can one write a query so that it would not time-out, but wouldn't give back duplicates?

Original mock query that won't time out:

SELECT a, b, c
FROM table
WHERE attribute IN ('attribute1','attribute2')

Timeout:

SELECT a, b, c, d
FROM table
WHERE attribute IN ('attribute1','attribute2')

If I put the second one in a while True loop, amend it with the LIMIT and OFFSET, use pd.read_sql(query, connection) to get the data, append it to a df list, and concat the list in the end, it gives me back the exact amount of lines that the first one, but with duplicates.

Continue reading...

Logar ou Criar uma Conta

[SQL] Redshift query duplicates

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[SQL] Redshift query duplicates

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis