将每个XML节点提取到单独的文本文件中
我有一个这样的xml文件:
<root>
<article>
<article_taxonomy></article_taxonomy>
<article_place>Somewhere</article_place>
<article_number>1</article_number>
<article_date>2001</article_date>
<article_body>Blah blah balh</article_body>
</article>
<article>
<article_taxonomy></article_taxonomy>
<article_place>Somewhere</article_place>
<article_number>2</article_number>
<article_date>2001</article_date>
<article_body>Blah blah balh</article_body>
</article>
...
...
more nodes
</root>
我想做的是把每一个节点(从<article>到</article>
标签之间的内容)提取出来,并写入一个单独的txt或xml文件。我还想保留这些标签。
有没有办法做到这一点,而不使用正则表达式?有没有什么建议?
2 个回答
0
试试这样做:
from xml.dom import minidom
xmlfile = minidom.parse('yourfile.xml')
#for example for 'article_body'
article_body = xmlfile.getElementsByTagName('article_body')
或者
import xml.etree.ElementTree as ET
xmlfile = ET.parse('yourfile.xml')
root_tag = xmlfile.getroot()
for each_article in root_tag.findall('article'):
article_taxonomy = each_article.find('article_taxonomy').text
article_place = each_article.find('article_place').text
# etc etc
1
这里有一种方法可以使用 ElementTree
来实现:
import xml.etree.ElementTree as ElementTree
def main():
with open('data.xml') as f:
et = ElementTree.parse(f)
for article in et.findall('article'):
xml_string = ElementTree.tostring(article)
# Now you can write xml_string to a new file
# Take care to name the files sequentially
if __name__ == '__main__':
main()