1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Python polars write / read csv handling of linebreaks (eol)

Discussão em 'Python' iniciado por Stack, Outubro 4, 2024 às 00:22.

  1. Stack

    Stack Membro Participativo

    I want to read in mockup data that contains a linebreak (eol) char. Here I utilize the faker package to simulate some data.

    I initialize a polars.DataFrame and write it to .csv. When I later on try to read the csv (see below), I receive an error, which indicates that the match between the column-name, the dtype (schema_overrides) and the data does not match.

    My best gues is, that the error is due to the linebreak / eol in the line field. If I comment out the generation of the address field it runs through smoothly. Now how does one best handle strings with linebreaks? I thought this should be catched via the quote_char (default =") and the quoting_style in df.write_csv (link)

    Error


    pydf = PyDataFrame.read_csv( polars.exceptions.ComputeError: could not parse "Edwards, Duncan and Moore" as dtype date at column 'Date_of_birth' (column number 5)

    The current offset in the file is 309563 bytes.
    Code Producing the error


    # Reading in throws an error
    df_throws = pl.read_csv("mockme_up.csv", schema_overrides=dtypes, separator=";")

    MRE Data


    import polars as pl
    from faker import Faker

    fake = Faker()
    # Erstellung der Mockup-Daten unter Verwendung von Faker
    N = int(1e4)
    data = {
    "Name": [fake.name() for _ in range(N)],
    "Address": [fake.address() for _ in range(N)],
    "Email": [fake.email() for _ in range(N)],
    "Phonenumber": [fake.phone_number() for _ in range(N)],
    "Date_of_birth": [
    fake.date_of_birth(minimum_age=18, maximum_age=90) for _ in range(N)
    ],
    "Company": [fake.company() for _ in range(N)],
    "Job": [fake.job() for _ in range(N)],
    "IBAN": [fake.iban() for _ in range(N)],
    "Creditcard": [fake.credit_card_number() for _ in range(N)],
    "Creation_date": [fake.date() for _ in range(N)],
    }

    dtypes = {
    "Name": pl.Utf8,
    "Address": pl.Utf8,
    "Email": pl.Utf8,
    "Phonenumber": pl.Utf8,
    "Date_of_birth": pl.Date,
    "Company": pl.Utf8,
    "Job": pl.Utf8,
    "IBAN": pl.Utf8,
    "Creditcard": pl.Int64,
    "Creation_date": pl.Date,
    }

    df = pl.DataFrame(data)
    df.write_csv("mockme_up.csv", separator=";", quote_style="non_numeric")
    print("=" * 50)
    print(f"Succeful created mockup data of shape {df.shape=}")
    print("=" * 50)

    Update / Solution


    This is now tracked as github issue: https://github.com/pola-rs/polars/issues/19078

    As a workaround passing n_threads=1 to read_csv() will fix the issue

    Continue reading...

Compartilhe esta Página