编辑多个XML文档的属性

<?xml version="1.0"?> <document DOCID="501.conll.txt"> <span type="sentence"> <extent> <charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq> </extent> </span><span type="sentence"> <extent> <charseq START="205" END="310">" The result of the second year-half is expected to improve on the early part of the year , " Atria said .</charseq>

>>> import re, os, sys >>> import xml.etree.ElementTree as etree >>> sentences = {} >>> xmlAddresses = getListOfFilesInFolders(['XMLFiles'],ending=u'.xml') # my function to grab all XML files >>> for docAddr in xmlAddresses: >>> parser = etree.XMLParser(encoding=u'utf-8') >>> tree = etree.parse(docAddr, parser=parser) >>> sentences = getTokenTextFeature(docAddr,tree,sentences) >>> rgxLeadingSpace = re.compile('^\"? .') >>> for sent in sentences.keys(): >>> text = sentences[sent]['sentence'] >>> if rgxLeadingSpace.findall(text): >>> print text # the second sentence is from the above XML doc " It rallied on ideas the market was oversold , " a trader said . " The result of the second year-half is expected to improve on the early part of the year , " Atria said . " The head of state 's holiday has only just begun , " the agency quoted Sergei Yastrzhembsky as saying , adding that the president was currently in a Kremlin residence near Moscow .

2条回答

网友

1楼 · 编辑于 2024-06-01 03:00:08

我将使用的方法不是提取匹配的句子，而是在遍历dom节点时，根据您的模式检查每个句子元素。这样，当您找到一个dom时，就可以直接使用您正在访问的element对象并修改它的START属性，然后将修改后的dom写入一个新的（或替换的）XML文件。你知道吗

网友

2楼 · 编辑于 2024-06-01 03:00:08

我不知道getTokenTextFeature是做什么的，但是这里有一个程序可以按照您要求的方式修改XML。你知道吗

xml='''<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
  <extent>
    <charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
  </extent>
</span><span type="sentence">
  <extent>
    <charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>
</extent></span></document>
'''

import re
import xml.etree.ElementTree as etree

root = etree.XML(xml)
for charseq in root.findall(".//span[@type='sentence']/extent/charseq[@START]"):
  match = re.match('^("? +)(.*)', charseq.text)
  if match:
    space,text = match.groups()
    charseq.set('START', str(int(charseq.get('START')) + len(space)))
    charseq.text = text
print etree.tostring(root)

相关问题更多 >

编程相关推荐

热门问题

热门文章