BeautifulSoup - lxml与html5lib解析器抓取差异

7 投票

2 回答

14924 浏览

提问于 2025-04-18 00:12

我正在使用BeautifulSoup 4和Python 2.7。我想从一个网站上提取一些特定的元素（数量，下面有示例）。但不知道为什么，lxml解析器只让我提取页面上的前三个元素。我想试试html5lib解析器，看看能否提取到我想要的所有元素。

这个页面包含多个商品，每个商品都有价格和数量。包含每个商品所需信息的代码部分看起来像这样：

<td class="size-price last first" colspan="4">
                    <span>453 grams </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>

我们来考虑以下三种情况：

案例 1 - 数据：

#! /usr/bin/python
from bs4 import BeautifulSoup
data = """
<td class="size-price last first" colspan="4">
                    <span>453 grams </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>"""                
soup = BeautifulSoup(data)
print soup.td.span.text

输出结果：

453 grams

案例 2 - LXML：

#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "lxml")
print soup.find('td', {'class': 'size-price'}).span.text

输出结果：

453 grams

案例 3 - HTML5LIB：

#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "html5lib")
print soup.find('td', {'class': 'size-price'}).span.text

我遇到了以下错误：

Traceback (most recent call last):
  File "C:\Users\Dom\Python-Code\src\Testing-Code.py", line 6, in <module>
    print soup.find('td', {'class': 'size-price'}).span.text
AttributeError: 'NoneType' object has no attribute 'span'

我该如何调整我的代码，以便使用html5lib解析器提取我想要的信息？我在控制台中简单打印soup后可以看到所有想要的信息，所以我觉得它应该能帮我得到我想要的东西。而lxml解析器就不行，所以我也很好奇，为什么使用lxml解析器时，似乎无法提取到所有的数量，如果我使用：

print [td.span.text for td in soup.find_all('td', {'class': 'size-price'})]

错误处理 lxml 解析器数据解析网页抓取信息提取 beautifulsoup html5lib

2 个回答

-1

试试下面的代码：

    from bs4 import BeautifulSoup
    data = """
    <td class="size-price last first" colspan="4">
                <span>453 grams </span>
        <span> <span class="strike">$619.06</span> <span 
    class="price">$523.91</span>
                </span>
            </td>"""                
    soup = BeautifulSoup(data)
    text = soup.get_text(strip=True)
    print text

回答于 2025-04-18 由 Python大师

分享举报

from lxml import etree

html = 'your html'
tree = etree.HTML(html)
tds = tree.xpath('.//td[@class="size-price last first"]')
for td in tds:
    price = td.xpath('.//span[@class="price"]')[0].text
    strike = td.xpath('.//span[@class="strike"]')[0].text
    spans = td.xpath('.//span')
    quantity = [i.text for i in spans if 'grams' in i.text][0].strip(' ')

当然可以！请把你想要翻译的内容发给我，我会帮你把它变得更简单易懂。

回答于 2025-04-18 由 Python大师

分享举报

BeautifulSoup - lxml与html5lib解析器抓取差异

2 个回答

撰写回答