1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[SQL] SQL to Retrieve, for all Rows, the Sum of Values of Previous Rows By Time Series

Discussão em 'Outras Linguagens' iniciado por Stack, Outubro 5, 2024 às 18:12.

  1. Stack

    Stack Membro Participativo

    My SQL is rusty. I have a DataFrame/table like this (partially shown, and this is just a sample, not the real data):

    +--------------------+-----+---+
    | timestamp|value| id|
    +--------------------+-----+---+
    |2024-10-05 20:38:...| 67| 0|
    |2024-10-05 19:38:...| 14| 1|
    |2024-10-05 18:38:...| 80| 2|
    |2024-10-05 17:38:...| 6| 3|
    +--------------------+-----+---+


    What I want to do, in plain English, is this: suppose an id is in row 0. I want get a result set of data such that for the id in row 0, I return the sum of all data in column value for all rows 3 hours before the timestamp in row 0, and including the value for row 0. Then, I want to do this for all the rows, 0 through n (where n rows could be large, on the order of hundreds of millions).

    So, my output would look like this (truncated a bit):

    +--------------------+-----+---+
    | timestamp|sum | id|
    +--------------------+-----+---+
    |2024-10-05 20:38:...| 167| 0| /* this result is the `value` of the id in row 0, plus the three hours' of previous data */
    |2024-10-05 19:38:...| 100| 1| /* this would be more than 100 with the full dataset, etc. */
    |2024-10-05 18:38:...| 86| 2|
    +--------------------+-----+---+ /* etc etc */


    I'm sure this question or a variant has been asked, but I've done a ton of research and I can't seem to find this. To be more specific, I'm working with Spark DataFrames, but vanilla SQL is also fine.

    Continue reading...

Compartilhe esta Página