如何在从PDF中提取文本时删除标题?

2024-04-29 19:04:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试对一组pdf文件运行LDA,以访问这些文件中的主要主题。我能够提取数据从pdf使用pdfminer。你知道吗

问题1:但问题是pdf中给出的图表和图像的标题和描述对我没有用处。如何从pdf中删除不需要的部分。你知道吗

问题2:在运行LDA模型之前,我想从文本中删除所有的换行符和标点符号。你知道吗

我用来提取数据的代码如下:

from pdfminer import .layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal
from pdfminer.layout import LTFigure
from pdfminer.pdfinterp import PDFPageInterpreter
import gensim
from gensim import corpora
from pprint import pprint
document = open('C:/Users/kaurj/Desktop/File1.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(document):
interpreter.process_page(page)    
layout = device.get_result() 

for element in layout:
    if isinstance(element, LTTextBoxHorizontal):
        values = element.get_text()
        print (values)

代码中使用的File1是嵌入的此处:-你知道吗

[https://onedrive.live.com/embed?cid=DA6170EA591F0D07&resid=DA6170EA591F0D07%21106&authkey=ALua6WdCD7Ct6zo&em=2“]


Tags: 文件数据代码fromimportgetpdfdevice
1条回答
网友
1楼 · 发布于 2024-04-29 19:04:45

如果标题本身遵循某种模式(就像在科学文本中那样),您可以使用正则表达式将其删除请参见link以获得快速概述,并this one尝试使用与此模式匹配的正则表达式(我假设它们将以“Figure”开头,后跟一个数字和一个长度不确定的字符串-这是使它变得有点棘手(很可能是换行符或其他指示符,细节取决于您使用的解析器和文档)。你知道吗

要清除文本,有几个选项。Gensim有一些工具,NLTK也有。最简单的版本是使用replace,这是一个内置的python函数。textdocument.replace(""\n", "")并对每个要用另一个字符(或在本例中,用“”,即无)切换的字符重复上述操作。我个人会推荐clean-text包,它非常灵活,可以为您完成大部分工作。你知道吗

举个例子:

from cleantext import clean

text = "I am a sample text. 
I have -many- weird characters, such as , . # and some numbers,
4335 and 12 more. 
Here is a newline character \n and a $ sign. 
Some words are CAPITALIZED and this is an email address: hello@example.com"


clean(text,
        fix_unicode=True,               # fix various unicode errors
        lower=True,                     # lowercase text
        no_line_breaks=True,           # strip line breaks 
        no_emails=True,                # replace all email addresses with a special token
        no_numbers=True,               # replace all numbers with a special token
        no_digits=True,                # replace all digits with a special token
        no_currency_symbols=True,      # replace all currency symbols with a special token
        no_punct=True,                 # fully remove punctuation
        replace_with_email="",
        replace_with_number="",
        replace_with_digit="",
        replace_with_currency_symbol="",
        lang="en") 

Out[3]: 'i am a sample text i have many weird characters such as and some numbers and more here is a newline character and a sign some words are capitalized and this is an email address'

相关问题 更多 >