在混乱网站上使用Beautiful Soup进行Python网页抓取

0 投票

3 回答

1214 浏览

数据工程师

提问于 2025-04-17 14:28

我想从这个网站上抓取以下三个数据点：%verified、FAR的数值和POD的数值。我想用BeautifulSoup来实现这个，但我对网站的结构不太熟悉，所以不知道这些元素具体在哪里。

有什么简单的方法可以做到这一点吗？

数据提取网页抓取 html解析数据分析 beautiful soup 网络爬虫网站结构数据点

3 个回答

就像That1Guy说的，你需要分析一下源页面的结构。在这个例子中，你很幸运……你想要找的数字用红色特别标出来了，使用了<span>标签。

这段代码会做到这一点：

>>> import urllib2
>>> import lxml.html
>>> url = ... # put your URL here
>>> html = urllib2.urlopen(url)
>>> soup = lxml.html.soupparser.fromstring(html)
>>> elements = soup.xpath('//th/span')
>>> print float(elements[0].text) # FAR
0.67
>>> print float(elements[1].text) # POD
0.58

需要注意的是，lxml.html.soupparser基本上和BeautifulSoup解析器是一样的（我现在手头没有BeautifulSoup）。

回答于 2025-04-17 由 Python大师

分享举报

如果你还没安装的话，先去安装一下Firebug这个工具，它是Firefox浏览器的一个插件，可以用来查看网页的html源代码。

接下来，你可以用urllib和BeautifulSoup这两个工具来获取和解析html内容。下面是一个简单的例子：

import urllib
from BeautifulSoup import BeautifulSoup

url = 'http://mesonet.agron.iastate.edu/cow/?syear=2009&smonth=9&sday=12&shour=12&eyear=2012&emonth=9&eday=12&ehour=12&wfo=ABQ&wtype[]=TO&hail=1.00&lsrbuffer=15&ltype[]=T&wind=58'
fp = urllib.urlopen(url).read()
soup = BeautifulSoup(fp)

print soup

从这里开始，我提供的链接应该能帮助你入门，教你如何获取你感兴趣的网页元素。

回答于 2025-04-17 由 Python大师

分享举报

我最后自己解决了这个问题——我使用了一种类似于 isedev 的方法，但我希望能找到一个更好的方式来获取“已验证”的数据：

import urllib2
from bs4 import BeautifulSoup

wfo = list()

def main():
    wfo = [i.strip() for i in open('C:\Python27\wfo.txt') if i[:-1]]
    soup = BeautifulSoup(urllib2.urlopen('http://mesonet.agron.iastate.edu/cow/?syear=2009&smonth=9&sday=12&shour=12&eyear=2012&emonth=9&eday=12&ehour=12&wfo=ABQ&wtype%5B%5D=TO&hail=1.00&lsrbuffer=15&ltype%5B%5D=T&wind=58').read())
    elements = soup.find_all("span")
    find_verify = soup.find_all('th')

    far= float(elements[1].text)
    pod= float(elements[2].text)
    verified = (find_verify[13].text[:-1])

回答于 2025-04-17 由 Python大师

分享举报

在混乱网站上使用Beautiful Soup进行Python网页抓取

3 个回答

撰写回答