如何使用Python从带有下拉字段的web链接中读取数据

2024-04-29 05:29:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我想读取和数据从下面链接到熊猫数据帧使用Python。在

url='https://www.nseindia.com/products/content/derivatives/equities/historical_fo.htm'

它有一些下拉字段,如选择工具、选择符号、选择年份、选择到期日、选择选项类型、输入执行价格、选择时间段等

NSE Page

我想将输出发送到pandas dataframe进行进一步处理。在


Tags: 工具数据httpscomurl链接wwwcontent
2条回答
import requests
import pandas as pd

#############################################
pd.set_option('display.max_rows', 500000)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 50000)
#############################################

# create session to get and keep cookies
s = requests.Session()

# get page and cookies
url = 'https://www.nseindia.com/products/content/derivatives/equities/historical_fo.htm'
s.get(url)

# get HTML with tables
symbol = ['SBIN']

dates = ['17-May-2019']

url = "https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?instrumentType=OPTSTK&symbol=" + symbol[0] + "&expiryDate=select&optionType=CE&strikePrice=&dateRange=day&fromDate=" + dates[0] + "&toDate=" + dates[0] + "&segmentLink=9&symbolCount="
# print(url)

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0',
    'X-Requested-With': 'XMLHttpRequest',
    'Referer': 'https://www.nseindia.com/products/content/derivatives/equities/historical_fo.htm'
}

# get HTML from url
r = requests.get(url, headers=headers)
# print('status:', r.status_code)
# print(r.text)

# user pandas to parse tables in HTML to DataFrames
all_tables = pd.read_html(r.text)
# print('tables:', len(all_tables))


# get first DataFrame
df = all_tables[0]
# print(df.columns)

df = df.rename(columns=df.iloc[1]).drop(df.index[0])
df = df.iloc[1:].reset_index(drop=True)

df = df[['Symbol','Date','Expiry','Optiontype','Strike Price','Close','LTP','No. of contracts','Open Int','Change in OI','Underlying Value']]
print(df)

在Chrome/Firefox的DevTool中使用"Network",我可以看到从浏览器到服务器的所有请求。当我点击“获取数据”时,我会看到一个带有下拉字段选项的url,比如

https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?instrumentType=FUTIDX&symbol=NIFTY&expiryDate=select&optionType=select&strikePrice=&dateRange=day&fromDate=&toDate=&segmentLink=9&symbolCount=

通常我可以在pd.read_html("https://...")中直接使用url来获取HTML中的所有表,然后我可以使用[0]来获取第一个表作为数据帧。在

因为我得到了错误,所以我使用模块requests来获取HTML,然后使用pd.read_html("string_with_html")将HTML中的所有表转换为数据帧。在

它给了我一个DataFrame的多级列索引和3个我删除的未知列。在

代码注释中的更多信息

import requests
import pandas as pd

# create session to get and keep cookies
s = requests.Session()

# get page and cookies 
url = 'https://www.nseindia.com/products/content/derivatives/equities/historical_fo.htm'
s.get(url)

# get HTML with tables
url = "https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?instrumentType=FUTIDX&symbol=NIFTY&expiryDate=select&optionType=select&strikePrice=&dateRange=day&fromDate=&toDate=&segmentLink=9&symbolCount="
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0',
    'X-Requested-With': 'XMLHttpRequest',
    'Referer': 'https://www.nseindia.com/products/content/derivatives/equities/historical_fo.htm'
}

# get HTML from url    
r = requests.get(url, headers=headers)
print('status:', r.status_code)
#print(r.text)

# user pandas to parse tables in HTML to DataFrames
all_tables = pd.read_html(r.text)
print('tables:', len(all_tables))


# get first DataFrame
df = all_tables[0]
#print(df.columns)

# drop multilevel column index
df.columns = df.columns.droplevel()
#print(df.columns)

# droo unknow columns
df = df.drop(columns=['Unnamed: 14_level_1', 'Unnamed: 15_level_1', 'Unnamed: 16_level_1'])
print(df.columns)

结果

^{pr2}$

相关问题 更多 >