1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] How to combine multiple similar datasets into one in Kedro dynamically?

Discussão em 'Python' iniciado por Stack, Outubro 8, 2024.

  1. Stack

    Stack Membro Participativo

    I use Kedro to run pipelines for multiple models that each generate a CSV file with the same schema. However, I need to combine each of the generated CSV files into one and do some post processing which will be my final output. This is what my catalog looks like:

    catalog.yml
    model1.output:
    type: pandas.CSVDataSet
    filepath: ${data_path}/output/model1/output.csv

    model2.output:
    type: pandas.CSVDataSet
    filepath: ${data_path}/output/model2/output.csv

    model3.output:
    type: pandas.CSVDataSet
    filepath: ${data_path}/output/model3/output.csv

    final_output:
    type: pandas.CSVDataSet
    filepath: ${data_path}/output/final_output.csv
    layer: output



    I can define my node such that each output is called individually and combined, like the following:

    node.py
    def combine(model_1_output, model_2_output, model_3_output):
    ...


    However, I would like to do that dynamically, where I define in my parameters a list of valid models, and only the datasets of those model outputs are loaded. This will be beneficial in case future models are added, or I need to run a subset of all my models selectively.

    parameters.yml
    valid_models:
    - model1
    - model2


    Is there a way where this can be done? I've tried using both pipeline and node hooks, however, I couldn't use either of them to solve this particular problem. It also seems like I can't access the global_parameters where my ${data_path} is defined directly in the node so that I can read the required datasets that way.

    Edit: Turns out that I can access the ${data_path} global parameter by defining it in the base/parameters.yml, and defining my path as a string:

    node.py
    def combine(params):
    base_path = params[data_path]
    for model in valid_models:
    data = pd.read_csv(f"base_path/output/{model}/output.csv")


    However, I am still not sure if this is the best approach, and can kedro's functions be used to implement this behaviour

    Continue reading...

Compartilhe esta Página