1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Import the NLTK library and texts from the Project Gutenberg electronic text archive,...

Discussão em 'Python' iniciado por Stack, Outubro 25, 2024 às 11:12.

  1. Stack

    Stack Membro Participativo

    I need to process a text file chesterton-brown.txt.

    Determine the number of words in the text.

    Identify the 10 most frequently used words in the text, build a bar chart based on these data.

    Remove stop words and punctuation from the text, then again find the 10 most frequently used words in the text and build a bar chart based on them.

    I would like to see the text I am processing, I have seen the following function used for this brown = gutenberg.words('chesterton-brown.txt') But it returns 6 words, is there really 6 words in this file?

    Also to identify the 10 most used words I need to do tokenization, as far as I understand, then delete the stop words and do it again. But I do not understand how to assign the contents of a text file to a variable to perform these operations. In general, the topic seemed to me very complicated and searching for information does not give me more understanding. It would be great if someone could tell me how it works in general, which functions are better to use.

    This is how I downloaded or imported the necessary text file. I just do not understand how to work with it.

    from nltk.corpus import gutenberg
    import nltk

    nltk.download('gutenberg')

    brown1 = gutenberg.fileids()
    print(brown1)

    Continue reading...

Compartilhe esta Página