如何让pypdf逐行读取页面内容？

import pyPdf def getPDFConten(path): content = "" num_pages = 10 p = file(path, "rb") pdf = pyPdf.PdfFileReader(p) for i in range(9, num_pages): x = pdf.getPage(i).extractText()+'\n' content += x content = " ".join(content.replace(u"\xa0", " ").strip().split()) return content con = getPDFContent("document.pdf") print con

1条回答

网友

1楼 · 发布于 2024-05-15 09:52:23

您可以尝试使用^{}从poppler实用程序调用pdftotext（可能使用-layout选项）。它对我来说比使用pypdf有效得多。

例如，我使用以下代码从PDF文件中提取CAS数字：

import subprocess
import re

def findCAS(pdf, page=None):
    '''Find all CAS numbers on the numbered page of a file.

    Arguments:
    pdf -- Name of the PDF file to search
    page -- number of the page to search. if None, search all pages.
    '''
    if page == None:
        args = ['pdftotext', '-layout', '-q', pdf, '-']
    else:
        args = ['pdftotext', '-f', str(page), '-l', str(page), '-layout',
                '-q', pdf, '-']
    txt = subprocess.check_output(args)
    candidates =  re.findall('\d{2,6}-\d{2}-\d{1}', txt)
    checked = [x.lstrip('0') for x in candidates if checkCAS(x)]
    return list(set(checked))

def checkCAS(cas):
    '''Check if a string is a valid CAS number.

    Arguments:
    cas -- string to check
    '''
    nums = cas[::-1].replace('-', '') # all digits in reverse order
    checksum = int(nums[0]) # first digit is the checksum
    som = 0
    # Checksum method from: http://nl.wikipedia.org/wiki/CAS-nummer
    for n, d in enumerate(nums[1:]):
        som += (n+1)*int(d)
    return som % 10 == checksum

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何让pypdf逐行读取页面内容？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >