1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] ' UnicodeDecodeError: can't decode' error on file without encoding in python...

Discussão em 'Python' iniciado por Stack, Setembro 12, 2024.

  1. Stack

    Stack Membro Participativo

    I am currently trying to open a file using the 'with open' command in python. I am planning to use this in a text-splitter, but this is not yet relevant. Please see the following code block:

    from langchain_text_splitters import RecursiveCharacterTextSplitter

    text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size just to show.
    separators=[
    "\n\n",
    "\n",
    " ",
    ".",
    ",",
    "\u200b", # Zero-width space
    "\uff0c", # Fullwidth comma
    "\u3001", # Ideographic comma
    "\uff0e", # Fullwidth full stop
    "\u3002", # Ideographic full stop
    "",
    ],
    chunk_size = 100,
    chunk_overlap = 20,
    length_function = len,
    is_separator_regex=False
    )

    # For now hardcoded:
    with open("My_path/My_document.docx", "r", encoding="utf-8") as f:
    document = f.read()

    split = text_splitter.create_documents([document])


    I have tried multiple encoding types before, as well as removing that part, but the result is always similar to the following error:

    "UnicodeDecodeError: 'cp932' codec can't decode byte 0xef in position 18: illegal multibyte sequence." with 'cp932'


    being the standard encoding and being replaced by whatever encoding I put in. For clarification, I am trying to do this with a Word document.

    The reason I believe there is no encoding stems from the following piece of code:

    # Given a file, return the encoding type of said file.
    def get_encoding_type(current_file):
    detector.reset()
    with open(current_file, 'rb') as open_file:
    for line in open_file:
    detector.feed(line)
    if detector.done:
    break
    detector.close()
    print(detector.result['encoding'])
    return detector.result['encoding']


    The above code will return None (which might mean the latter code block doesn't work, or there really is no encoding.)

    What I am doing wrong.

    Continue reading...

Compartilhe esta Página