下面是我正在处理的一个XML文档:
<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
<extent>
<charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
</extent>
</span><span type="sentence">
<extent>
<charseq START="205" END="310">" The result of the second year-half is expected to improve on the early part of the year , " Atria said .</charseq>
我在一组XML文档中循环,以检索所有以空格开头的句子。我可以轻松捕获所有错误(前导空格):
>>> import re, os, sys
>>> import xml.etree.ElementTree as etree
>>> sentences = {}
>>> xmlAddresses = getListOfFilesInFolders(['XMLFiles'],ending=u'.xml') # my function to grab all XML files
>>> for docAddr in xmlAddresses:
>>> parser = etree.XMLParser(encoding=u'utf-8')
>>> tree = etree.parse(docAddr, parser=parser)
>>> sentences = getTokenTextFeature(docAddr,tree,sentences)
>>> rgxLeadingSpace = re.compile('^\"? .')
>>> for sent in sentences.keys():
>>> text = sentences[sent]['sentence']
>>> if rgxLeadingSpace.findall(text):
>>> print text # the second sentence is from the above XML doc
" It rallied on ideas the market was oversold , " a trader said .
" The result of the second year-half is expected to improve on the early part of the year , " Atria said .
" The head of state 's holiday has only just begun , " the agency quoted Sergei Yastrzhembsky as saying , adding that the president was currently in a Kremlin residence near Moscow .
我需要做的是,在找到错误之后,遍历包含这些错误的所有XML文件并调整它们的START
属性。例如,这是上面XML文档中包含前导空格的一句话:
<charseq START="205" END="310">" The result of the second year-half is expected to improve on the early part of the year , " Atria said .</charseq>
应该是这样的:
<charseq START="207" END="310">The result of the second year-half is expected to improve on the early part of the year , " Atria said .</charseq>
我想我提供了所有必要的代码。 如果有人能帮我,我会创建一个100万StackOverflow帐户,并向上投你100万次!:) 谢谢!你知道吗
我将使用的方法不是提取匹配的句子,而是在遍历dom节点时,根据您的模式检查每个句子元素。这样,当您找到一个dom时,就可以直接使用您正在访问的element对象并修改它的START属性,然后将修改后的dom写入一个新的(或替换的)XML文件。你知道吗
我不知道
getTokenTextFeature
是做什么的,但是这里有一个程序可以按照您要求的方式修改XML。你知道吗相关问题 更多 >
编程相关推荐