[Python] Python polars write / read csv handling of linebreaks (eol)

Stack · Outubro 4, 2024 às 00:22

I want to read in mockup data that contains a linebreak (eol) char. Here I utilize the faker package to simulate some data.

I initialize a polars.DataFrame and write it to .csv. When I later on try to read the csv (see below), I receive an error, which indicates that the match between the column-name, the dtype (schema_overrides) and the data does not match.

My best gues is, that the error is due to the linebreak / eol in the line field. If I comment out the generation of the address field it runs through smoothly. Now how does one best handle strings with linebreaks? I thought this should be catched via the quote_char (default =") and the quoting_style in df.write_csv (link)

Error

pydf = PyDataFrame.read_csv( polars.exceptions.ComputeError: could not parse "Edwards, Duncan and Moore" as dtype date at column 'Date_of_birth' (column number 5)

The current offset in the file is 309563 bytes.

Code Producing the error

# Reading in throws an error
df_throws = pl.read_csv("mockme_up.csv", schema_overrides=dtypes, separator=";")

MRE Data

import polars as pl
from faker import Faker

fake = Faker()
# Erstellung der Mockup-Daten unter Verwendung von Faker
N = int(1e4)
data = {
"Name": [fake.name() for _ in range(N)],
"Address": [fake.address() for _ in range(N)],
"Email": [fake.email() for _ in range(N)],
"Phonenumber": [fake.phone_number() for _ in range(N)],
"Date_of_birth": [
fake.date_of_birth(minimum_age=18, maximum_age=90) for _ in range(N)
],
"Company": [fake.company() for _ in range(N)],
"Job": [fake.job() for _ in range(N)],
"IBAN": [fake.iban() for _ in range(N)],
"Creditcard": [fake.credit_card_number() for _ in range(N)],
"Creation_date": [fake.date() for _ in range(N)],
}

dtypes = {
"Name": pl.Utf8,
"Address": pl.Utf8,
"Email": pl.Utf8,
"Phonenumber": pl.Utf8,
"Date_of_birth": pl.Date,
"Company": pl.Utf8,
"Job": pl.Utf8,
"IBAN": pl.Utf8,
"Creditcard": pl.Int64,
"Creation_date": pl.Date,
}

df = pl.DataFrame(data)
df.write_csv("mockme_up.csv", separator=";", quote_style="non_numeric")
print("=" * 50)
print(f"Succeful created mockup data of shape {df.shape=}")
print("=" * 50)

Update / Solution

This is now tracked as github issue: https://github.com/pola-rs/polars/issues/19078

As a workaround passing n_threads=1 to read_csv() will fix the issue

Continue reading...

Logar ou Criar uma Conta

[Python] Python polars write / read csv handling of linebreaks (eol)

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[Python] Python polars write / read csv handling of linebreaks (eol)

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis