使用BeautifulSoup或LXML进行网页抓取

0 投票

4 回答

3246 浏览

数据工程师

提问于 2025-04-16 14:46

我看了一些网络讲座，现在需要帮助来实现一个功能：我一直在使用 lxml.html。最近，Yahoo 改变了他们网站的结构。

目标页面是：

http://finance.yahoo.com/quote/IBM/options?date=1469750400&straddle=true

在 Chrome 浏览器中使用检查工具时，我可以看到数据在

 //*[@id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table

然后还有一些其他的代码。

我该如何把这些数据提取出来，放到一个列表里？
我想把股票从 "LLY" 改成 "Msft"？
我该如何在不同日期之间切换……并获取所有月份的数据。

数据结构 lxml 数据提取网页抓取 html解析 beautifulsoup 股票数据日期选择

4 个回答

这里有一个简单的例子，可以从股票表格中提取所有数据：

import urllib
import lxml.html
html = urllib.urlopen('http://finance.yahoo.com/q/op?s=lly&m=2014-11-15').read()
doc = lxml.html.fromstring(html)
# scrape figures from each stock table
for table in doc.xpath('//table[@class="details-table quote-table Fz-m"]'):
    rows = []
    for tr in table.xpath('./tbody/tr'):
        row = [td.text_content().strip() for td in tr.xpath('./td')]
        rows.append(row)
    print rows

如果你想提取不同股票和日期的数据，就需要更改网址。比如，这里是微软（Msft）前一天的数据链接：

http://finance.yahoo.com/q/op?s=msft&m=2014-11-14

回答于 2025-04-16 由 Python大师

分享举报

我知道你说你不能使用 lxml.html 这个库。不过我还是想告诉你，使用这个库的方法，因为它真的很不错。所以我提供了使用它的代码，虽然我现在不再用 BeautifulSoup 了——这个库已经不再维护，速度慢，而且接口也不好用。

下面的代码会解析网页，并把结果写入一个csv文件。

import lxml.html
import csv

doc = lxml.html.parse('http://finance.yahoo.com/q/os?s=lly&m=2011-04-15')
# find the first table contaning any tr with a td with class yfnc_tabledata1
table = doc.xpath("//table[tr/td[@class='yfnc_tabledata1']]")[0]

with open('results.csv', 'wb') as f:
    cf = csv.writer(f)
    # find all trs inside that table:
    for tr in table.xpath('./tr'):
        # add the text of all tds inside each tr to a list
        row = [td.text_content().strip() for td in tr.xpath('./td')]
        # write the list to the csv file:
        cf.writerow(row)

就这样！ lxml.html 用起来简单又好用！！真可惜你不能用它。

这里是生成的 results.csv 文件中的一些内容：

LLY110416C00017500,N/A,0.00,17.05,18.45,0,0,17.50,LLY110416P00017500,0.01,0.00,N/A,0.03,0,182
LLY110416C00020000,15.70,0.00,14.55,15.85,0,0,20.00,LLY110416P00020000,0.06,0.00,N/A,0.03,0,439
LLY110416C00022500,N/A,0.00,12.15,12.80,0,0,22.50,LLY110416P00022500,0.01,0.00,N/A,0.03,2,50

回答于 2025-04-16 由 Python大师

分享举报

这个回答是基于@hoju的内容：

import lxml.html
import calendar
from datetime import datetime

exDate  = "2014-11-22"
symbol  = "LLY"
dt      = datetime.strptime(exDate, '%Y-%m-%d')
ym      = calendar.timegm(dt.utctimetuple())

url     = 'http://finance.yahoo.com/q/op?s=%s&date=%s' % (symbol, ym,)
doc     = lxml.html.parse(url)
table   = doc.xpath('//table[@class="details-table quote-table Fz-m"]/tbody/tr')

rows    = []        
for tr in table:
     d = [td.text_content().strip().replace(',','') for td in tr.xpath('./td')]
     rows.append(d)

print rows

回答于 2025-04-16 由 Python大师

分享举报

使用BeautifulSoup或LXML进行网页抓取

4 个回答

撰写回答