无法获取正确的div以从表中获取数据

2024-05-21 07:37:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从https://gmatclub.com/forum/decision-tracker.html中删除数据。经过大量的点击和试用,我仍然无法确定如何从表中获取数据

import requests
from bs4 import BeautifulSoup
url = "https://gmatclub.com/forum/decision-tracker.html"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
container = soup.find('div', attrs = {'class' : 'mainPage'})
print(container)

Tags: 数据httpsimportcomurlcontainerhtmlpage
1条回答
网友
1楼 · 发布于 2024-05-21 07:37:33

如果您想练习,请查看Developer Toos -> Network -> XHR并获取更新端点:

https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates?limit=50&year=all

并使用它获取当前数据

以下是方法:

import requests

with requests.Session() as connection:
    connection.headers.update(
        {
            "referer": "https://gmatclub.com/forum/decision-tracker.html",
            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",
        }
    )
    _ = connection.get("https://gmatclub.com/forum/decision-tracker.html")
    endpoint = connection.get("https://gmatclub.com/api/schools/v1/forum/app-tracker-latest-updates?limit=50&year=all").json()
    for item in endpoint["statistics"]:
        print(item)

这将输出一个字典列表,实际上是您的表。然后,您可以从中访问任何密钥

{'id': '194901', 'user_id': '273781', 'applicant_type': 'regular', 'round_id': '4236', 'status_id': '9', 'school_id': '5', 'school_title': 'Booth', 'program_id': '11', 'program_type': '1', 'date': '2021-05-24 23:56:46', 'seconds_ago': '511', 'country': None, 'state': None, 'gmat_quant': None, 'gmat_verbal': None, 'gmat_total': None, 'gmat_modified': None, 'gre_quant': None, 'gre_verbal': None, 'gre_total': None, 'gre_modified_time': None, 'ea_quant': None, 'ea_verbal': None, 'ea_ir': None, 'ea_total': None, 'ea_modified_time': None, 'cat_india_percentile': None, 'cat_india_total': None, 'cat_india_modified_time': None, 'industry': None, 'we': None, 'gpa': None, 'accepted_via': 'phone', 'scholarship': '1', 'user_colour': '', 'truncate_username': '0', 'user_name': 'binhtbc'}

或者您可以将响应转储到pandas dataframe。例如:

df = pd.DataFrame(endpoint["statistics"])
print(df.head(10))

输出:

       id  user_id applicant_type  ... user_colour truncate_username    user_name
0  194901   273781        regular  ...                             0      binhtbc
1  183152   643532        regular  ...                             0         AG23
2  194061     None        regular  ...      2a2a2a                 0      private
3  192923  1034549        regular  ...                             0  RicardoLima
4  193383  1034549        regular  ...                             0  RicardoLima
5  194900  1130431        regular  ...                          None          VFA
6  177937   876400        regular  ...      F87431                 0   icanhazmba
7  194899  1128750        regular  ...                          None     Amanda29
8  194898  1128002        regular  ...                          None      Raydiaz
9  193974  1021516        regular  ...                             0    Kurathore

如果您愿意,请将其另存为.csv文件:

    df.to_csv("your_table_data.csv", index=False)

相关问题 更多 >