如何获取pdf的特定部分?

2024-06-07 06:50:03 发布

您现在位置:Python中文网/ 问答频道 /正文

我有PDF文件。我想得到文本的不同部分

例如,让我拥有以下页面:

0021 Literacy and numeracy
Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed 
to teach fundamental skills in reading, writing and arithmetic. The typical age range of 
participants can be used to distinguish between detailed field 0011 ‘Basic programmes and 
qualifications’ and this detailed field. 
Programmes and qualifications with the following main content are classified here:
Basic remedial programmes for youth or adults
Literacy
Numeracy
003 Personal skills
0031 Personal skills
Personal skills are defined by reference to the effects on the individual’s capacity (mental, 
social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic
Programmes and qualifications with the following main content are classified here:

我想让所有的行都包含4个数字,以及后面的所有段落,直到这句话: Programmes and qualifications with the following main content are classified here:

因此,结果是:

First_list= [0021 Literacy and numeracy,0031 Personal skills]
secend_list=[    Literacy and numeracy are programmes or qualifications arranged mainly for  
  adults, designed  to teach fundamental skills in reading, writing and arithmetic. The typical age range of  participants can be used to distinguish between detailed field 0011 ‘Basic programmes and qualifications’ and this detailed field. , Personal skills are defined by reference to the effects on the individual’s capacity (mental, social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic]

我试着这么做,但没能完成

我试图得到pdf的文本,并找到我想要的文本之前或在同一行中的关键字

import re
f = open('f.pdf','rb')
pdf_reader = PyPDF2.PdfFileReader(f)
while count < num_pages:
    pageObj = pdf_reader.getPage(count)
    count +=1
    text += pageObj.extractText()
text_fefore = re.findall('Programmes and qualifications with the following main content are classified here',text)
4_digit = re.findall(r'\d\d\d\d',text)

所以我认为text_fefore正是我需要在它前面加一段的那一行。另外4_digit是一个我想要整行的数字

你知道我怎样才能完成这段代码吗

注意:4位数字在行首。

我还应该提到text_fefore = re.search('Programmes and qualifications with the following main content are classified here',text) 给我这个句子的开头和结尾。所以我知道我在哪里停止接收文本,但我应该如何找到起点

对于这个:4_digit = re.search(r'\d\d\d\d',text)我应该找到行末尾的span。这就是我对上述问题的回答


Tags: andthetotextfieldbasicwithare
1条回答
网友
1楼 · 发布于 2024-06-07 06:50:03

您可以尝试利用nltk库来执行句子区分,并选择下一个句子,后跟整数条件,我在这里添加了这个片段

import nltk
sents = nltk.tokenize.sent_tokenize(text)
lines = text.split('\n')
res = [[],[]]
for idx, sent in enumerate(lines):
    ints = re.findall(r'(\d+)', sent)
    if sent[:4] in ints:
        res[0].append(sent)
        res[1].append(nltk.sent_tokenize(' '.join(lines[idx+1:]))[0])
out
        

输出:

[['0021 Literacy and numeracy', '0031 Personal skills'],
 ['Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed  to teach fundamental skills in reading, writing and arithmetic.',
  'Personal skills are defined by reference to the effects on the individual’s capacity (mental,  social etc.).']]

相关问题 更多 >