1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

  2. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[SQL] Performing data validation using hash comparison on snowflake tables

Discussão em 'Outras Linguagens' iniciado por Stack, Novembro 13, 2024.

  1. Stack

    Stack Membro Participativo

    Here is the scenario:

    I have data in two schemas within Snowflake, aiming to validate that both tables contain identical data values. I’ve set up a dynamic SQL approach in Snowflake to accomplish this.

    My code first retrieves the columns from both tables using CTEs, then applies the MD5 hash function to generate hash values for comparison. Finally, I construct a dynamic SQL query intended to display the comparison results.

    However, instead of returning the actual output, the dynamic SQL only displays the SELECT statement itself, so I’m uncertain if my approach is entirely correct.

    Here is my code snippet:

    -- Step 1: Extract the column list for both tables (schema_1 & schema_2)
    WITH column_list_schema_1 AS (
    SELECT COLUMN_NAME
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_SCHEMA = 'schema_1'
    AND TABLE_NAME = 'table_1'
    ),
    column_list_schema_2 AS (
    SELECT COLUMN_NAME
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_SCHEMA = 'schema_2'
    AND TABLE_NAME = 'table_2'
    ),
    -- Step 2: Generate dynamic SQL for comparing values in each row and generating hash values for comparison
    dynamic_sql AS (
    SELECT
    LISTAGG(
    'CASE WHEN h."' || COLUMN_NAME || '" != s."' || COLUMN_NAME || '" THEN ''' || COLUMN_NAME || ''' END AS "' || COLUMN_NAME || '"',
    ', '
    ) WITHIN GROUP (ORDER BY COLUMN_NAME) AS column_comparisons,
    -- Generate MD5 hashes for comparison (on a row level)
    LISTAGG(
    'MD5(CAST(h."' || COLUMN_NAME || '" AS STRING))', ', '
    ) WITHIN GROUP (ORDER BY COLUMN_NAME) AS schema_1_hash,
    LISTAGG(
    'MD5(CAST(s."' || COLUMN_NAME || '" AS STRING))', ', '
    ) WITHIN GROUP (ORDER BY COLUMN_NAME) AS schema_2_hash
    FROM column_list_heroku
    WHERE COLUMN_NAME IN (SELECT COLUMN_NAME FROM column_list_schema_2)
    )
    -- Step 3: Generate the final comparison query string
    SELECT
    'SELECT h.row_id AS h_row_id,
    s.row_id AS s_row_id,
    h.parent_id AS parent_id,
    MD5(CONCAT(' || ds.h_hash || ')) AS h_hash,
    MD5(CONCAT(' || ds.s_hash || ')) AS s_hash,
    CASE WHEN MD5(CONCAT(' || ds.h_hash || ')) = MD5(CONCAT(' || ds.s_hash || '))
    THEN ''Match'' ELSE ''Mismatch'' END AS comparison_result,
    ' || ds.column_comparisons || '
    FROM
    (SELECT *, ROW_NUMBER() OVER () AS row_id, parent_id FROM schema_1.your_table_name) h
    FULL OUTER JOIN
    (SELECT *, ROW_NUMBER() OVER () AS row_id, parent_id FROM schema_2.your_table_name) s
    ON h.row_id = s.row_id
    WHERE MD5(CONCAT(' || ds.h_hash || ')) != MD5(CONCAT(' || ds.s_hash || '));'
    FROM dynamic_sql ds;

    Continue reading...

Compartilhe esta Página