[Python] How to combine multiple similar datasets into one in Kedro dynamically?

Stack · Outubro 8, 2024

I use Kedro to run pipelines for multiple models that each generate a CSV file with the same schema. However, I need to combine each of the generated CSV files into one and do some post processing which will be my final output. This is what my catalog looks like:

catalog.yml
model1.output:
type: pandas.CSVDataSet
filepath: ${data_path}/output/model1/output.csv

model2.output:
type: pandas.CSVDataSet
filepath: ${data_path}/output/model2/output.csv

model3.output:
type: pandas.CSVDataSet
filepath: ${data_path}/output/model3/output.csv

final_output:
type: pandas.CSVDataSet
filepath: ${data_path}/output/final_output.csv
layer: output

I can define my node such that each output is called individually and combined, like the following:

node.py
def combine(model_1_output, model_2_output, model_3_output):
...

However, I would like to do that dynamically, where I define in my parameters a list of valid models, and only the datasets of those model outputs are loaded. This will be beneficial in case future models are added, or I need to run a subset of all my models selectively.

parameters.yml
valid_models:
- model1
- model2

Is there a way where this can be done? I've tried using both pipeline and node hooks, however, I couldn't use either of them to solve this particular problem. It also seems like I can't access the global_parameters where my ${data_path} is defined directly in the node so that I can read the required datasets that way.

Edit: Turns out that I can access the ${data_path} global parameter by defining it in the base/parameters.yml, and defining my path as a string:

node.py
def combine(params):
base_path = params[data_path]
for model in valid_models:
data = pd.read_csv(f"base_path/output/{model}/output.csv")

However, I am still not sure if this is the best approach, and can kedro's functions be used to implement this behaviour

Continue reading...

Logar ou Criar uma Conta

[Python] How to combine multiple similar datasets into one in Kedro dynamically?

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[Python] How to combine multiple similar datasets into one in Kedro dynamically?

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis