使用Python从XML中提取文本

18 投票
6 回答
75134 浏览
提问于 2025-04-17 03:53

我有一个这样的示例xml文件:

<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>

我想提取标题标签和内容标签里的内容。

用什么方法提取数据比较好呢?是用模式匹配,还是用xml模块?或者有没有更好的方法来提取数据呢?

6 个回答

3

代码:

from xml.etree import cElementTree as ET

tree = ET.parse("test.xml")
root = tree.getroot()

for page in root.findall('page'):
    print("Title: ", page.find('title').text)
    print("Content: ", page.find('content').text)

输出:

Title:  Chapter 1
Content:  Welcome to Chapter 1
Title:  Chapter 2
Content:  Welcome to Chapter 2
3

你也可以试试这段代码来提取文本:

from bs4 import BeautifulSoup
import csv

data ="""<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>"""

soup = BeautifulSoup(data, "html.parser")

########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
    title.append(i.get_text())

########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
    content.append(i.get_text())

doc1 = list(zip(title, content))
for i in doc1:
    print(i)

输出结果:

('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')
26

Python里已经有一个内置的XML库,特别是叫做 ElementTree 的这个库。举个例子:

>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
...   <title>Chapter 1</title>
...   <content>Welcome to Chapter 1</content>
... </page>
... <page>
...  <title>Chapter 2</title>
...  <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
...     title = page.find('title').text
...     content = page.find('content').text
...     print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2

撰写回答