尝试用requests_html在Python中抓取动态网站
当我尝试抓取这个网站时,遇到了一个问题,我不知道哪里出错了。我试着用Htmlsession,但Python告诉我应该用AsyncHTMLSession,因为前者无法执行循环。使用AsyncHTMLSession时,我一直遇到这个问题。
url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm"
session = AsyncHTMLSession()
response = session.get(url)
await response.html.arender()
await session.close()
print(response.html)
print(response.html.html)
这是我收到的错误信息
AttributeError Traceback (most recent call last)
Cell In [12], line 4
2 session = AsyncHTMLSession()
3 response = session.get(url)
----> 4 await response.html.arender()
5 await session.close()
7 print(response.html)
AttributeError: '_asyncio.Future' object has no attribute 'html'
请大家帮帮忙,我会非常感激。
我在渲染代码中添加了await,尝试在渲染代码中加入一个睡眠时间,也加了await asession.close(),但还是出现了同样的错误代码。
1 个回答
0
使用其他网址来加载HTML(不是那种Ajax方式的),比如:
from io import StringIO
import pandas as pd
import requests
# orinal_url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm'
new_url = "https://www.sec.gov/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0"
}
soup = BeautifulSoup(requests.get(new_url, headers=headers).content, "html.parser")
balance_sheets = soup.select_one("#balance_sheets ~ table")
# for example, load the table into dataframe:
df = pd.read_html(StringIO(str(balance_sheets)))[0].fillna("")
print(df)
输出结果是:
0 1 2 3 4 5 6 7 8
0
1 (In millions)
2
3
4 June 30, 2023 2023 2022 2022
5
6 Assets
7 Current assets:
8 Cash and cash equivalents $ 34704 $ 13931
9 Short-term investments 76558 90826
10
11
12 Total cash, cash equivalents, and short-term investments 111262 104757
13 Accounts receivable, net of allowance for doubtful accounts of $650 and $633 48688 44261
14 Inventories 2500 3742
15 Other current assets 21807 16924
16
17
18 Total current assets 184257 169684
19 Property and equipment, net of accumulated depreciation of $68,251 and $59,660 95641 74398
20 Operating lease right-of-use assets 14346 13148
21 Equity investments 9879 6891
22 Goodwill 67886 67524
23 Intangible assets, net 9366 11298
24 Other long-term assets 30601 21897
25
26
27 Total assets $ 411976 $ 364840
...