我正在尝试提取MS word文档(link)下面示例中特定部分的文本。基本上,我需要将所有带有标记-- ASN1START
和-- ASN1STOP
的文本写入一个不包含上述标记的文件
示例文本
-- ASN1START
CounterCheck ::= SEQUENCE {
rrc-TransactionIdentifier RRC-TransactionIdentifier,
criticalExtensions CHOICE {
c1 CHOICE {
counterCheck-r8 CounterCheck-r8-IEs,
spare3 NULL, spare2 NULL, spare1 NULL
},
criticalExtensionsFuture SEQUENCE {}
}
}
CounterCheck-r8-IEs ::= SEQUENCE {
drb-CountMSB-InfoList DRB-CountMSB-InfoList,
nonCriticalExtension CounterCheck-v8a0-IEs OPTIONAL
}
CounterCheck-v8a0-IEs ::= SEQUENCE {
lateNonCriticalExtension OCTET STRING OPTIONAL,
nonCriticalExtension CounterCheck-v1530-IEs OPTIONAL
}
CounterCheck-v1530-IEs ::= SEQUENCE {
drb-CountMSB-InfoListExt-r15 DRB-CountMSB-InfoListExt-r15 OPTIONAL, -- Need ON
nonCriticalExtension SEQUENCE {} OPTIONAL
}
DRB-CountMSB-InfoList ::= SEQUENCE (SIZE (1..maxDRB)) OF DRB-CountMSB-Info
DRB-CountMSB-InfoListExt-r15 ::= SEQUENCE (SIZE (1..maxDRBExt-r15)) OF DRB-CountMSB-Info
DRB-CountMSB-Info ::= SEQUENCE {
drb-Identity DRB-Identity,
countMSB-Uplink INTEGER(0..33554431),
countMSB-Downlink INTEGER(0..33554431)
}
-- ASN1STOP
我试过使用docx
from docx import *
import re
import json
fileName = './data/36331-f80.docx'
document = Document(fileName)
startText = re.compile(r'-- ASN1START')
for para in document.paragraphs:
# look for each paragraph
text = para.text
print(text)
# if startText.match(para.text):
# print(text)
似乎上面提到的每一行标签都是一个段落。我需要帮助提取标签中的文本
您可以尝试先将所有文档/段落文本读入单个字符串,然后使用
re.findall
查找目标标记之间的所有匹配文本:注意,我们在正则表达式中使用DOT-ALL模式,以确保
.*
可以匹配跨换行出现的标记之间的内容相关问题 更多 >
编程相关推荐