<p>考虑使用一个类似python的<a href="http://lxml.de/" rel="nofollow">lxml</a>模块,<code>html()</code>方法来刮取html表数据,然后迁移到pandas数据帧。虽然有一些自动化特性,比如<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html" rel="nofollow">pandas.read_html()</a>,但是这种方法提供了对html内容中细微差别的更多控制,比如<em>feb4列span</em>。下面在表中的<code><td></code>位置使用了一个xpath表达式,使用方括号<code>[]</code>:</p>
<pre><code>import requests
import pandas as pd
from lxml import etree
# READ IN AND PARSE WEB DATA
url = "https://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices"
rq = requests.get(url)
htmlpage = etree.HTML(rq.content)
# INITIALIZE LISTS
dates = []
openstock = []
highstock = []
lowstock = []
closestock = []
volume = []
adjclose = []
# ITERATE THROUGH SEVEN COLUMNS OF TABLE
for i in range(1,8):
htmltable = htmlpage.xpath("//tr[td/@class='yfnc_tabledata1']/td[{}]".format(i))
# APPEND COLUMN DATA TO CORRESPONDING LIST
for row in htmltable:
if i == 1: dates.append(row.text)
if i == 2: openstock.append(row.text)
if i == 3: highstock.append(row.text)
if i == 4: lowstock.append(row.text)
if i == 5: closestock.append(row.text)
if i == 6: volume.append(row.text)
if i == 7: adjclose.append(row.text)
# CLEAN UP COLSPAN VALUE (AT FEB. 4)
dates = [d for d in dates if len(d.strip()) > 3]
del dates[7]
del openstock[7]
# MIGRATE LISTS TO DATA FRAME
df = pd.DataFrame({'Dates':dates,
'Open':openstock,
'High':highstock,
'Low':lowstock,
'Close':closestock,
'Volume':volume,
'AdjClose':adjclose})
# AdjClose Close Dates High Low Open Volume
#0 93.99 93.99 Feb 12, 2016 94.50 93.01 94.19 40,121,700
#1 93.70 93.70 Feb 11, 2016 94.72 92.59 93.79 49,686,200
#2 94.27 94.27 Feb 10, 2016 96.35 94.10 95.92 42,245,000
#3 94.99 94.99 Feb 9, 2016 95.94 93.93 94.29 44,331,200
#4 95.01 95.01 Feb 8, 2016 95.70 93.04 93.13 54,021,400
#5 94.02 94.02 Feb 5, 2016 96.92 93.69 96.52 46,418,100
#...
#61 111.73 112.34 Nov 13, 2015 115.57 112.27 115.20 45,812,400
#62 115.10 115.72 Nov 12, 2015 116.82 115.65 116.26 32,525,600
#63 115.48 116.11 Nov 11, 2015 117.42 115.21 116.37 45,218,000
#64 116.14 116.77 Nov 10, 2015 118.07 116.06 116.90 59,127,900
#65 119.92 120.57 Nov 9, 2015 121.81 120.05 120.96 33,871,400
</code></pre>