如何在从PDF中提取文本时删除标题？

from pdfminer import .layout import LAParams from pdfminer.pdfinterp import PDFResourceManager from pdfminer.converter import PDFPageAggregator from pdfminer.pdfpage import PDFPage from pdfminer.layout import LTTextBoxHorizontal from pdfminer.layout import LTFigure from pdfminer.pdfinterp import PDFPageInterpreter import gensim from gensim import corpora from pprint import pprint document = open('C:/Users/kaurj/Desktop/File1.pdf', 'rb') rsrcmgr = PDFResourceManager() laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) for page in PDFPage.get_pages(document): interpreter.process_page(page) layout = device.get_result() for element in layout: if isinstance(element, LTTextBoxHorizontal): values = element.get_text() print (values)

1条回答

网友

1楼 · 发布于 2024-05-16 23:07:18

如果标题本身遵循某种模式（就像在科学文本中那样），您可以使用正则表达式将其删除请参见link以获得快速概述，并this one尝试使用与此模式匹配的正则表达式（我假设它们将以“Figure”开头，后跟一个数字和一个长度不确定的字符串-这是使它变得有点棘手（很可能是换行符或其他指示符，细节取决于您使用的解析器和文档）。你知道吗

要清除文本，有几个选项。Gensim有一些工具，NLTK也有。最简单的版本是使用replace，这是一个内置的python函数。textdocument.replace(""\n", "")并对每个要用另一个字符（或在本例中，用“”，即无）切换的字符重复上述操作。我个人会推荐clean-text包，它非常灵活，可以为您完成大部分工作。你知道吗

举个例子：

from cleantext import clean

text = "I am a sample text. 
I have -many- weird characters, such as , . # and some numbers,
4335 and 12 more. 
Here is a newline character \n and a $ sign. 
Some words are CAPITALIZED and this is an email address: hello@example.com"


clean(text,
        fix_unicode=True,               # fix various unicode errors
        lower=True,                     # lowercase text
        no_line_breaks=True,           # strip line breaks 
        no_emails=True,                # replace all email addresses with a special token
        no_numbers=True,               # replace all numbers with a special token
        no_digits=True,                # replace all digits with a special token
        no_currency_symbols=True,      # replace all currency symbols with a special token
        no_punct=True,                 # fully remove punctuation
        replace_with_email="",
        replace_with_number="",
        replace_with_digit="",
        replace_with_currency_symbol="",
        lang="en") 

Out[3]: 'i am a sample text i have many weird characters such as and some numbers and more here is a newline character and a sign some words are capitalized and this is an email address'

相关问题更多 >

编程相关推荐

热门问题

热门文章