用Python模块BeautifulSoup抓取XML，需要树中的特定标签

1 投票

1 回答

6155 浏览

提问于 2025-04-17 21:33

我最近在写一个Python脚本，想要提取“Leg”标签下的“Duration”和“Distance”标签。不过问题是，在“Step”标签里也有“Duration”和“Distance”这两个子标签，而“Step”标签又是“Leg”标签的子标签。当我抓取数据时，这两个标签也被一起返回了。下面是相关的XML内容：

<DirectionsResponse>
        <route>
           <leg>
            <step>...</step>
            <step>
                <start_location>
                <lat>38.9096855</lat>
                <lng>-77.0435397</lng>
                </start_location>
                <duration>
                <text>1 min</text>
                </duration>
                <distance>
                <text>39 ft</text>
                </distance>
            </step>
            <duration>
            <text>2 hours 19 mins</text>
            </duration>
            <distance>
            <text>7.1 mi</text>
            </distance>
              </leg>
        </route>
</DirectionsResponse>

这是我正在使用的Python脚本：

import urllib
from BeautifulSoup import BeautifulSoup

url = 'https://www.somexmlgenerator.com/directions/xml?somejscript'
res = urllib.urlopen(url)
html = res.read()

soup = BeautifulSoup(html)
soup.prettify()
leg = soup.findAll('leg')

for eachleg in leg:
    another_duration = eachleg('duration')
    print eachleg

正如我提到的，我已经在这个问题上花了一段时间了，我也尝试过使用lxml库，但由于这个XML是动态生成的，我在用它抓取数据时遇到了困难。我现在的做法是把XML当作HTML来抓取，但我也很乐意听取其他建议，因为我还是个新手！

XML lxml 动态生成数据提取 beautifulsoup 数据抓取编程新手标签解析

1 个回答

使用BeautifulSoup（建议使用4版本，也就是叫做bs4），你需要在findAll里加上recursive=False，这样可以避免它抓取到错误的时间长度：

from bs4 import BeautifulSoup

soup = BeautifulSoup(..., 'xml')

for leg in soup.route.find_all('leg', recursive=False):
    duration = leg.duration.text.strip()
    distance = leg.distance.text.strip()

或者你也可以用CSS选择器：

for leg in soup.select('route > leg'):
    duration = leg.duration.text.strip()
    distance = leg.distance.text.strip()

如果用lxml的话，你只需要用XPath就可以了：

durations = root.xpath('/DirectionsResponse/route/leg/duration/text/text()')
distances = root.xpath('/DirectionsResponse/route/leg/distance/text/text()')

回答于 2025-04-17 由 Python大师

分享举报

用Python模块BeautifulSoup抓取XML，需要树中的特定标签

1 个回答

撰写回答