解析PDF后清理文本文件

0 投票
2 回答
3684 浏览
提问于 2025-04-18 15:50

我已经解析了一个PDF文件,并尽力清理了内容,但在对文本文件中的信息进行对齐时遇到了困难。

我的输出结果是这样的:

Zone
1
Report Name
ARREST
Incident Time
01:41
Location of Occurrence
1300 block Liverpool St
Neighborhood
Highland Park
Incident
14081898
Age
27
Gender
M
Section
3921(a)
3925
903
Description
Theft by Unlawful Taking or Disposition - Movable item
Receiving Stolen Property.
Criminal Conspiracy.

我希望它看起来像这样:

Zone:    1
Report Name:    ARREST
Incident Time:    01:41
Location of Occurrence:    1300 block Liverpool St
Neighborhood:    Highland Park
Incident:    14081898
Age:    27
Gender:    M
Section, Description:
3921(a): Theft by Unlawful Taking or Disposition - Movable item
3925: Receiving Stolen Property.
903: Criminal Conspiracy.

我尝试过遍历这个列表,但问题是有些字段并不存在。这导致我提取了错误的信息。

这是我用来解析PDF的代码:

import os
import urllib2
import time
from datetime import datetime, timedelta
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

def parsePDF(infile, outfile):

    password = ''
    pagenos = set()
    maxpages = 0
    # output option
    outtype = 'text'
    imagewriter = None
    rotation = 0
    stripcontrol = False
    layoutmode = 'normal'
    codec = 'utf-8'
    pageno = 1
    scale = 1
    caching = True
    showpageno = True
    laparams = LAParams()
    rsrcmgr = PDFResourceManager(caching=caching)

    if outfile:
        outfp = file(outfile, 'w+')
    else:
        outfp = sys.stdout

    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, imagewriter=imagewriter)
    fp = file(infile, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fp, pagenos,
                                      maxpages=maxpages, password=password,
                                      caching=caching, check_extractable=True):

        interpreter.process_page(page)
    fp.close()
    device.close()
    outfp.close()
    return  


# Set time zone to EST
#os.environ['TZ'] = 'America/New_York'
#time.tzset()

# make sure folder system is set up
if not os.path.exists("../pdf/"):
    os.makedirs("../pdf/")
if not os.path.exists("../txt/"):
    os.makedirs("../txt/")

# Get yesterday's name and lowercase it
yesterday = (datetime.today() - timedelta(1))
yesterday_string = yesterday.strftime("%A").lower()

# Also make a numberical representation of date for filename purposes
yesterday_short = yesterday.strftime("%Y%m%d")

# Get pdf from blotter site, save it in a file
pdf = urllib2.urlopen("http://www.city.pittsburgh.pa.us/police/blotter/blotter_" + yesterday_string + ".pdf").read();
f = file("../pdf/" + yesterday_short + ".pdf", "w+")
f.write(pdf)
f.close()

# Convert pdf to text file
parsePDF("../pdf/" + yesterday_short + ".pdf", "../txt/" + yesterday_short + ".txt")

# Save text file contents in variable
parsed_pdf = file("../txt/" + yesterday_short + ".txt", "r").read()

这是我目前的进展:

import os

OddsnEnds = [ "PITTSBURGH BUREAU OF POLICE", "Incident Blotter", "Sorted by:", "DISCLAIMER:", "Incident Date", "assumes", "Page", "Report Name"]    


if not os.path.exists("../out/"):
    os.makedirs("../out/")  
with open("../txt/20140731.txt", 'r') as file:
    blotterList = file.readlines()

with open("../out/test2.txt", 'w') as outfile:
    cleanList = []
    for line in blotterList:
        if not any ([o in line for o in OddsnEnds]):
            cleanList.append(line)
    while '\n' in cleanList:
        cleanList.remove('\n')
    for i in [i for i, j in enumerate(cleanList) if j == 'ARREST\n']:
        print ('Incident:%s' % cleanList[i])
    for i in [i for i, j in enumerate(cleanList) if j == 'Incident Time\n']:
            print ('Time:%s' % cleanList[i+1])  

但是使用遍历后,我得到的输出是:

Time:16:20

Time:17:40

Time:17:53

Time:18:05

Time:Location of Occurrence

因为那个事件没有提供时间。另外,顺便提一下,所有字符串的结尾都有\n。

任何建议和帮助都非常感谢。

2 个回答

1

一般来说,从PDF文件中提取文本(特别是当你想保留文本的格式、间距和布局时)这件事并不总是能做到100%准确。我是从一家制作流行库(xpdf)来提取PDF文本的公司的技术支持人员那里了解到这一点的,那时候我正在做相关的项目。我当时研究了好几种提取PDF文本的库,包括xpdf和其他一些库。之所以它们不能总是给出完美的结果,有一些明确的技术原因(虽然在很多情况下它们确实能做到);这些原因与PDF格式的特性以及PDF是如何生成的有关。当你从某些PDF中提取文本时,即使你在库中使用了像keep_format=True这样的选项,布局和间距也可能无法保留。

解决这个问题的唯一永久办法就是不需要从PDF文件中提取文本。相反,尽量使用生成PDF文件时所用的数据格式和数据源,然后再进行文本提取或处理。当然,如果你无法访问这些源,那说起来容易做起来难。

2

我最喜欢从PDF文件中提取文本的方法是使用 pdftotext 工具(这个工具来自 poppler)。我通常会加上 -layout 这个选项,这样可以很好地保留文档的原始布局。

你可以通过Python中的 subprocess 模块来使用这个工具。

撰写回答