从无头浏览器页面的第一张表中获取href链接(playwright._impl._errors.Error:事件循环已关闭!Playwright已经停止了吗?)
我正在尝试从一个无头浏览器页面的第一个表格中获取链接,但出现的错误信息对我没有帮助,只是一堆符号“^”在下面。
我不得不使用无头浏览器,因为我在抓取空表格,想了解这个网站的HTML是怎么工作的,但我承认我并不太明白它是怎么回事。
我还想把链接补全,以便后续使用,这就是下面代码的最后三行:
from playwright.sync_api import sync_playwright
# headless browser to scrape
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://fbref.com/en/comps/9/Premier-League-Stats")
#open the file up
with open("path", 'r') as f:
file = f.read()
years = list(range(2024,2022, -1))
all_matches = []
standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"
for year in years:
standings_table = page.locator("table.stats_table").first
link_locators = standings_table.get_by_role("link").all()
for l in link_locators:
l.get_attribute("href")
print(link_locators)
link_locators = [l for l in links if "/squads/" in l]
team_urls = [f"https://fbref.com{l}" for l in link_locators]
print(team_urls)
browser.close()
我得到的错误追踪信息只有:
Traceback (most recent call last):
File "path", line 27, in <module>
link_locators = standings_table.get_by_role("link").all()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "path\.venv\Lib\site-packages\playwright\sync_api\_generated.py", line 15936, in all
return mapping.from_impl_list(self._sync(self._impl_obj.all()))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "path\.venv\Lib\site-packages\playwright\_impl\_sync_base.py", line 102, in _sync
raise Error("Event loop is closed! Is Playwright already stopped?")
playwright._impl._errors.Error: Event loop is closed! Is Playwright already stopped?
Process finished with exit code 1
我的代码只有33行,因为这是一个循环的开始,所以我不太确定最后两个错误指的是什么。
我就是无法提取到链接,可能和 .first
有关。
我尝试了从 使用python playwright获取链接 中找到的解决方案,但它并没有奏效。
1 个回答
1
当使用上下文管理器(with
)的代码块结束时,页面和浏览器会被关闭,所以在这个代码块外面你是不能使用它们的。下面是一个简单的错误重现示例:
from playwright.sync_api import sync_playwright # 1.40.0
with sync_playwright() as p:
browser = p.chromium.launch()
browser.close()
这里有一个重写的建议:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
url = "<Your URL>"
page.goto(url, wait_until="domcontentloaded")
team_urls = []
for year in range(2024, 2022, -1):
standings_table = page.locator("table.stats_table").first
for x in standings_table.get_by_role("link").all():
href = x.get_attribute("href")
if "/squads/" in href:
team_urls.append(f'https://www.fbref.com{href}')
print(team_urls)
browser.close()
阻塞资源可以帮助加快一些速度:
# ...
def handle(route, request):
block = "image", "script", "xhr", "fetch"
if request.resource_type in block:
return route.abort()
route.continue_()
page.route("**", handle)
page.goto(url, wait_until="domcontentloaded")
# ...
不过你也可以更简单高效地做到这一点,而不需要使用Playwright,因为数据在静态HTML中是可以获取的:
import requests # 2.25.1
from bs4 import BeautifulSoup # 4.10.0
url = "<Your URL>"
soup = BeautifulSoup(requests.get(url).text, "lxml")
team_urls = []
for year in range(2024, 2022, -1):
standings_table = soup.select_one("table.stats_table")
for x in standings_table.select("a"):
href = x["href"]
if "/squads/" in href:
team_urls.append(f'https://www.fbref.com{href}')
print(team_urls)
基准测试:
使用Playwright(并且阻塞资源):
real 0m4.875s
user 0m1.331s
sys 0m0.250s
使用Requests/BS:
real 0m0.517s
user 0m0.376s
sys 0m0.029s