我有PDF文件。我想得到文本的不同部分
例如,让我拥有以下页面:
0021 Literacy and numeracy
Literacy and numeracy are programmes or qualifications arranged mainly for adults, designed
to teach fundamental skills in reading, writing and arithmetic. The typical age range of
participants can be used to distinguish between detailed field 0011 ‘Basic programmes and
qualifications’ and this detailed field.
Programmes and qualifications with the following main content are classified here:
Basic remedial programmes for youth or adults
Literacy
Numeracy
003 Personal skills
0031 Personal skills
Personal skills are defined by reference to the effects on the individual’s capacity (mental,
social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic
Programmes and qualifications with the following main content are classified here:
我想让所有的行都包含4个数字,以及后面的所有段落,直到这句话:
Programmes and qualifications with the following main content are classified here:
因此,结果是:
First_list= [0021 Literacy and numeracy,0031 Personal skills]
secend_list=[ Literacy and numeracy are programmes or qualifications arranged mainly for
adults, designed to teach fundamental skills in reading, writing and arithmetic. The typical age range of participants can be used to distinguish between detailed field 0011 ‘Basic programmes and qualifications’ and this detailed field. , Personal skills are defined by reference to the effects on the individual’s capacity (mental, social etc.). This detailed field covers personal skills programmes not included in 0011 ‘Basic]
我试着这么做,但没能完成
我试图得到pdf的文本,并找到我想要的文本之前或在同一行中的关键字
import re
f = open('f.pdf','rb')
pdf_reader = PyPDF2.PdfFileReader(f)
while count < num_pages:
pageObj = pdf_reader.getPage(count)
count +=1
text += pageObj.extractText()
text_fefore = re.findall('Programmes and qualifications with the following main content are classified here',text)
4_digit = re.findall(r'\d\d\d\d',text)
所以我认为text_fefore
正是我需要在它前面加一段的那一行。另外4_digit
是一个我想要整行的数字
你知道我怎样才能完成这段代码吗
注意:4位数字在行首。
我还应该提到text_fefore = re.search('Programmes and qualifications with the following main content are classified here',text)
给我这个句子的开头和结尾。所以我知道我在哪里停止接收文本,但我应该如何找到起点
对于这个:4_digit = re.search(r'\d\d\d\d',text)
我应该找到行末尾的span
。这就是我对上述问题的回答
您可以尝试利用
nltk
库来执行句子区分,并选择下一个句子,后跟整数条件,我在这里添加了这个片段输出:
相关问题 更多 >
编程相关推荐