使用Python从XML中提取文本
我有一个这样的示例xml文件:
<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>
我想提取标题标签和内容标签里的内容。
用什么方法提取数据比较好呢?是用模式匹配,还是用xml模块?或者有没有更好的方法来提取数据呢?
6 个回答
3
代码:
from xml.etree import cElementTree as ET
tree = ET.parse("test.xml")
root = tree.getroot()
for page in root.findall('page'):
print("Title: ", page.find('title').text)
print("Content: ", page.find('content').text)
输出:
Title: Chapter 1
Content: Welcome to Chapter 1
Title: Chapter 2
Content: Welcome to Chapter 2
3
你也可以试试这段代码来提取文本:
from bs4 import BeautifulSoup
import csv
data ="""<page>
<title>Chapter 1</title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<title>Chapter 2</title>
<content>Welcome to Chapter 2</content>
</page>"""
soup = BeautifulSoup(data, "html.parser")
########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
title.append(i.get_text())
########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
content.append(i.get_text())
doc1 = list(zip(title, content))
for i in doc1:
print(i)
输出结果:
('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')
26
Python里已经有一个内置的XML库,特别是叫做 ElementTree
的这个库。举个例子:
>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
... <title>Chapter 1</title>
... <content>Welcome to Chapter 1</content>
... </page>
... <page>
... <title>Chapter 2</title>
... <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
... title = page.find('title').text
... content = page.find('content').text
... print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2