尝试用requests_html在Python中抓取动态网站

1 投票
1 回答
32 浏览
提问于 2025-04-12 13:57

当我尝试抓取这个网站时,遇到了一个问题,我不知道哪里出错了。我试着用Htmlsession,但Python告诉我应该用AsyncHTMLSession,因为前者无法执行循环。使用AsyncHTMLSession时,我一直遇到这个问题。

url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm"
session = AsyncHTMLSession()
response = session.get(url)
await response.html.arender()
await session.close() 

print(response.html)
print(response.html.html)

这是我收到的错误信息

AttributeError                            Traceback (most recent call last)
Cell In [12], line 4
      2 session = AsyncHTMLSession()
      3 response = session.get(url)
----> 4 await response.html.arender()
      5 await session.close() 
      7 print(response.html)

AttributeError: '_asyncio.Future' object has no attribute 'html'

请大家帮帮忙,我会非常感激。

我在渲染代码中添加了await,尝试在渲染代码中加入一个睡眠时间,也加了await asession.close(),但还是出现了同样的错误代码。

1 个回答

0

使用其他网址来加载HTML(不是那种Ajax方式的),比如:

from io import StringIO
import pandas as pd
import requests

# orinal_url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm'
new_url = "https://www.sec.gov/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0"
}
soup = BeautifulSoup(requests.get(new_url, headers=headers).content, "html.parser")

balance_sheets = soup.select_one("#balance_sheets ~ table")

# for example, load the table into dataframe:
df = pd.read_html(StringIO(str(balance_sheets)))[0].fillna("")
print(df)

输出结果是:

                                                                                           0 1     2       3  4 5     6       7  8 
0                                                                                                                                 
1                                                                              (In millions)                                      
2                                                                                                                                 
3                                                                                                                                  
4                                                                                   June 30,    2023    2023       2022    2022    
5                                                                                                                                 
6                                                                                     Assets                                       
7                                                                            Current assets:                                       
8                                                                  Cash and cash equivalents       $   34704          $   13931   
9                                                                     Short-term investments           76558              90826    
10                                                                                                                                 
11                                                                                                                                 
12                                  Total cash, cash equivalents, and short-term investments          111262             104757   
13              Accounts receivable, net of allowance for doubtful accounts of $650 and $633           48688              44261   
14                                                                               Inventories            2500               3742   
15                                                                      Other current assets           21807              16924   
16                                                                                                                                 
17                                                                                                                                 
18                                                                      Total current assets          184257             169684   
19            Property and equipment, net of accumulated depreciation of $68,251 and $59,660           95641              74398   
20                                                       Operating lease right-of-use assets           14346              13148   
21                                                                        Equity investments            9879               6891   
22                                                                                  Goodwill           67886              67524   
23                                                                    Intangible assets, net            9366              11298   
24                                                                    Other long-term assets           30601              21897   
25                                                                                                                                 
26                                                                                                                                 
27                                                                              Total assets       $  411976          $  364840   

...

撰写回答