<p>在Chrome/Firefox的<code>DevTool</code>中使用<code>"Network"</code>,我可以看到从浏览器到服务器的所有请求。当我点击“获取数据”时,我会看到一个带有下拉字段选项的url,比如</p>
<p><a href="https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?instrumentType=FUTIDX&symbol=NIFTY&expiryDate=select&optionType=select&strikePrice=&dateRange=day&fromDate=&toDate=&segmentLink=9&symbolCount=" rel="nofollow noreferrer">https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?instrumentType=FUTIDX&symbol=NIFTY&expiryDate=select&optionType=select&strikePrice=&dateRange=day&fromDate=&toDate=&segmentLink=9&symbolCount=</a></p>
<p>通常我可以在<code>pd.read_html("https://...")</code>中直接使用url来获取HTML中的所有表,然后我可以使用<code>[0]</code>来获取第一个表作为数据帧。在</p>
<p>因为我得到了错误,所以我使用模块<code>requests</code>来获取HTML,然后使用<code>pd.read_html("string_with_html")</code>将HTML中的所有表转换为数据帧。在</p>
<p>它给了我一个<code>DataFrame</code>的多级列索引和3个我删除的未知列。在</p>
<p>代码注释中的更多信息</p>
<pre><code>import requests
import pandas as pd
# create session to get and keep cookies
s = requests.Session()
# get page and cookies
url = 'https://www.nseindia.com/products/content/derivatives/equities/historical_fo.htm'
s.get(url)
# get HTML with tables
url = "https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?instrumentType=FUTIDX&symbol=NIFTY&expiryDate=select&optionType=select&strikePrice=&dateRange=day&fromDate=&toDate=&segmentLink=9&symbolCount="
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'https://www.nseindia.com/products/content/derivatives/equities/historical_fo.htm'
}
# get HTML from url
r = requests.get(url, headers=headers)
print('status:', r.status_code)
#print(r.text)
# user pandas to parse tables in HTML to DataFrames
all_tables = pd.read_html(r.text)
print('tables:', len(all_tables))
# get first DataFrame
df = all_tables[0]
#print(df.columns)
# drop multilevel column index
df.columns = df.columns.droplevel()
#print(df.columns)
# droo unknow columns
df = df.drop(columns=['Unnamed: 14_level_1', 'Unnamed: 15_level_1', 'Unnamed: 16_level_1'])
print(df.columns)
</code></pre>
<p>结果</p>
^{pr2}$