如何获取pdf的特定部分？

0021 Literacy and numeracy Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed to teach fundamental skills in reading, writing and arithmetic. The typical age range of participants can be used to distinguish between detailed field 0011 ‘Basic programmes and qualifications’ and this detailed field. Programmes and qualifications with the following main content are classified here: Basic remedial programmes for youth or adults Literacy Numeracy 003 Personal skills 0031 Personal skills Personal skills are defined by reference to the effects on the individual’s capacity (mental, social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic Programmes and qualifications with the following main content are classified here:

First_list= [0021 Literacy and numeracy,0031 Personal skills] secend_list=[ Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed to teach fundamental skills in reading, writing and arithmetic. The typical age range of participants can be used to distinguish between detailed field 0011 ‘Basic programmes and qualifications’ and this detailed field. , Personal skills are defined by reference to the effects on the individual’s capacity (mental, social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic]

import re f = open('f.pdf','rb') pdf_reader = PyPDF2.PdfFileReader(f) while count < num_pages: pageObj = pdf_reader.getPage(count) count +=1 text += pageObj.extractText() text_fefore = re.findall('Programmes and qualifications with the following main content are classified here',text) 4_digit = re.findall(r'\d\d\d\d',text)

1条回答

网友

1楼 · 发布于 2024-06-07 06:50:03

您可以尝试利用nltk库来执行句子区分，并选择下一个句子，后跟整数条件，我在这里添加了这个片段

import nltk
sents = nltk.tokenize.sent_tokenize(text)
lines = text.split('\n')
res = [[],[]]
for idx, sent in enumerate(lines):
    ints = re.findall(r'(\d+)', sent)
    if sent[:4] in ints:
        res[0].append(sent)
        res[1].append(nltk.sent_tokenize(' '.join(lines[idx+1:]))[0])
out

输出：

[['0021 Literacy and numeracy', '0031 Personal skills'],
 ['Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed  to teach fundamental skills in reading, writing and arithmetic.',
  'Personal skills are defined by reference to the effects on the individual’s capacity (mental,  social etc.).']]

相关问题更多 >

编程相关推荐

热门问题

热门文章