从具有多个选项卡的网站中提取数据

from bs4 import BeautifulSoup, SoupStrainer import requests import pandas as pd # For establishing connection proxies = {'http': 'http:...'} url = 'http://yit.maya-tour.co.il/yit-pass/Drop_Report.aspx?client_code=2660&coordinator_code=2669' page = requests.get(url, proxies=proxies) data = page.text soup = BeautifulSoup(data, "lxml") for link in soup.find_all('a'): print(link.get('href')) html = requests.get(url, proxies=proxies).text df_list = pd.read_html(html) df = df_list[1] df.to_csv('my data.csv')

2条回答

网友

1楼 · 编辑于 2024-04-19 07:56:45

每15秒，网页就会调用下面的javasscript代码：

function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}

打开浏览器开发工具并在此函数中放置断点。理解代码提交的参数后，使用requests（或其他http客户机）从python代码提交表单。你知道吗

网友

2楼 · 编辑于 2024-04-19 07:56:45

您应该提取第一页的超链接并在代码中使用它！（如果没有超链接，请将其他URL放入下面的循环中）

import pandas as pd

df_list = []
//call each page here. i assume you have page number at the end of main url
for p in range(1, n):
    url = 'http://yit.maya-tour.co.il/yit-pass/Drop_Report.aspx?client_code=2660&   coordinator_code=2669?pNumber=%d' %p
    df_list.append(pd.read_html(url)[0])

df = pd.concat(df_list)
print(df)
df.to_csv('my data.csv')

相关问题更多 >

编程相关推荐

热门问题

热门文章