用靓汤找具体班级

<html> <body> <div class=" status-icon-row for-sale-row home-summary-row"> </div> <div class=" home-summary-row"> <span class=""> $1,342,144 </span> </div> </body> </html>

from bs4 import BeautifulSoup import requests zpid = "18429834" url = "http://www.zillow.com/homes/" + zpid + "_zpid/" response = requests.get(url) html = response.content #html = '<html><body><div class=" status-icon-row for-sale-row home-summary-row"></div><div class=" home-summary-row"><span class=""> $1,342,144 </span></div></body></html>' soup = BeautifulSoup(html, "html5lib") results = soup.find_all('div', attrs={"class":"home-summary-row"}) print(results)

3条回答

网友

1楼 · 编辑于 2024-06-12 08:38:57

您的HTML是非格式的，在这种情况下，选择正确的解析器至关重要。在^{}中，目前有3个可用的HTML解析器，它们以不同的方式工作和处理中断的HTML：

html.parser（内置，不需要额外的模块）
lxml（最快的，需要安装lxml）
html5lib（最宽松的，需要安装html5lib）

Differences between parsers文档页更详细地描述了这些差异。在您的案例中，为了证明这一区别：

>>> from bs4 import BeautifulSoup
>>> import requests
>>> 
>>> zpid = "18429834"
>>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
>>> response = requests.get(url)
>>> html = response.content
>>> 
>>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
0
>>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
3
>>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
3

如您所见，在您的例子中，html.parser和lxml都能完成任务，但是html5lib不能。

网友

2楼 · 编辑于 2024-06-12 08:38:57

import requests
from bs4 import BeautifulSoup

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

g_data = soup.find_all("div", {"class": "home-summary-row"})

print g_data[1].text

#for item in g_data:
#        print item("span")[0].text
#        print '\n'

我也做了这件事，但看起来有人比我强。

无论如何都要去发帖。

网友

3楼 · 编辑于 2024-06-12 08:38:57

根据W3.org Validator，HTML有很多问题，比如散乱的结束标记和跨多行分割的标记。例如：

<a 
href="http://www.zillow.com/danville-ca-94526/sold/"  title="Recent home sales" class=""  data-za-action="Recent Home Sales"  >

这种标记会使BeautifulSoup解析HTML更加困难。

您可能需要尝试运行一些清理HTML的操作，例如删除每行末尾的换行符和尾随空格。美化组还可以为您清理HTML树：

from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

相关问题更多 >

编程相关推荐

热门问题

热门文章