[Python] How to Accurately Extract Title, Headings, and Subheadings from PDF Research Papers?

Stack · Outubro 3, 2024 às 08:52

I'm trying to extract the title, headings, and subheadings from research papers in PDF format. I've tried various approaches but haven't been able to get accurate results. Here are the steps I took:

1. Tried Using PyMuPDF (fitz) I used PyMuPDF (fitz) to extract the text from PDFs. While I was able to get the text, the problem is that the formatting is lost (e.g., headings and subheadings aren't easily distinguishable). There is also extra noise from other parts of the document like citations and footnotes.

2. Prompting Language Models I also experimented with prompting language models (LLMs) to analyze the extracted text. I used Ollama for offline processing, but the results were not accurate enough. When I tried OpenAI's GPT and Gemini, they provided accurate output, but I want a solution that works offline.

What I've tried:

PyMuPDF (fitz)

Ollama (llama3.1, gemma)

OpenAI GPT and Gemini for accurate extraction, but they require online usage.

PyPDF2 and similar libraries, but they also return unstructured text.

What I need:

Accurate extraction of the title, headings, and subheadings from PDF research papers.

An offline solution.

Minimal noise from extra content like citations, page numbers, etc.

Is there a reliable offline method or some additional steps I can take to:

Identify and accurately extract the title, headings, and subheadings.

Minimize noise and irrelevant content in the output.

Any help would be appreciated. Thanks in advance!

Continue reading...

Logar ou Criar uma Conta

[Python] How to Accurately Extract Title, Headings, and Subheadings from PDF Research Papers?

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[Python] How to Accurately Extract Title, Headings, and Subheadings from PDF Research Papers?

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis