网页抓取，Python 和 BeautifulSoup，如何存储输出

0 投票

2 回答

513 浏览

提问于 2025-04-18 07:35

我正在尝试使用网页抓取的方法，从www.wunderground.com获取一些温度和降水量的数据（虽然他们有一个API，但我在我的项目中必须使用网页抓取的方法）。

我的问题是，我不知道抓取到的数据该怎么存储。

这是我的代码示例：

import urllib2  
from bs4 import BeautifulSoup
url = "http://www.wunderground.com/history/airport/KBUF/2014/5/25/DailyHistory.html"
soup = BeautifulSoup(urllib2.urlopen(url).read()

#Mean Temperature Values
mean_temp_row = soup.findAll('table')[0].findAll('tr')[2]
for tds in mean_temp_row.findAll('td'):
    print tds.text

我得到的输出是：

Mean Temperature

15 °C


16 °C

我想知道怎么才能得到类似这样的结果：station = {"Temp_Mean":[15 , 16]}

数据存储数据提取网页抓取网络爬虫 beautifulsoup 温度数据降水量数据

2 个回答

在考虑了TurpIF的回答后，这是我的代码

def collect_data(url):
    soup = BeautifulSoup(urllib2.urlopen(url).read())
    Mean_temp = soup.findAll('table')[0].findAll('tr')[2].findAll('td')
    temp = Mean_temp[1].text.split()[0].encode('utf8')
    rows = soup.findAll('table')[0].findAll('tr')
    for num,row in enumerate(rows):
        if "Precipitation" in row.text:
            preci_line = num
    Preci = soup.findAll('table')[0].findAll('tr')[preci_line].findAll('td')  
    perci = Preci[1].text.split()[0].encode('utf8')
    return temp,perci

所以，

url = "http://www.wunderground.com/history/airport/KBUF/2014/5/25/DailyHistory.html"
temp,perci = collect_data(url)

回答于 2025-04-18 由 Python大师

分享举报

这个输出格式总是一样的吗？如果是的话，我们可以看到信息名称在这一行的第一个单元格（td）里。接下来是一个空的单元格，然后是最小值，再接着是两个空的单元格，最后是最大值。

所以你可以这样做：

def celcius2float(celcius):
    return float(celcius.split('°')[0].strip())

cells = Mean_Temp_Row.findAll('td')
name = cells[0].text
min_temp = celcius2float(cells[2].text)
max_temp = celcius2float(cells[5].text)

# Then you can do all you want with this suff :
station = {name: [min_temp, max_temp]}

回答于 2025-04-18 由 Python大师

分享举报

网页抓取，Python 和 BeautifulSoup，如何存储输出

2 个回答

撰写回答