从无头浏览器页面的第一张表中获取href链接（playwright._impl._errors.Error：事件循环已关闭！Playwright已经停止了吗？）

1 投票

1 回答

52 浏览

提问于 2025-04-14 17:05

我正在尝试从一个无头浏览器页面的第一个表格中获取链接，但出现的错误信息对我没有帮助，只是一堆符号“^”在下面。

我不得不使用无头浏览器，因为我在抓取空表格，想了解这个网站的HTML是怎么工作的，但我承认我并不太明白它是怎么回事。

我还想把链接补全，以便后续使用，这就是下面代码的最后三行：

from playwright.sync_api import sync_playwright

# headless browser to scrape
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://fbref.com/en/comps/9/Premier-League-Stats")

#open the file up
with open("path", 'r') as f:
    file = f.read()

years = list(range(2024,2022, -1))

all_matches = []

standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"

for year in years:
    standings_table = page.locator("table.stats_table").first

    link_locators = standings_table.get_by_role("link").all()
    for l in link_locators:
        l.get_attribute("href")
    print(link_locators)

    link_locators = [l for l in links if "/squads/" in l]
    team_urls = [f"https://fbref.com{l}" for l in link_locators]
    print(team_urls)

browser.close()

我得到的错误追踪信息只有：

Traceback (most recent call last):
  File "path", line 27, in <module>
    link_locators = standings_table.get_by_role("link").all()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path\.venv\Lib\site-packages\playwright\sync_api\_generated.py", line 15936, in all
    return mapping.from_impl_list(self._sync(self._impl_obj.all()))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "path\.venv\Lib\site-packages\playwright\_impl\_sync_base.py", line 102, in _sync
    raise Error("Event loop is closed! Is Playwright already stopped?")
playwright._impl._errors.Error: Event loop is closed! Is Playwright already stopped?

Process finished with exit code 1

我的代码只有33行，因为这是一个循环的开始，所以我不太确定最后两个错误指的是什么。

我就是无法提取到链接，可能和 .first 有关。

我尝试了从使用python playwright获取链接中找到的解决方案，但它并没有奏效。

错误处理自动化测试 html解析数据抓取链接提取事件循环 Playwright 无头浏览器

1 个回答

当使用上下文管理器（with）的代码块结束时，页面和浏览器会被关闭，所以在这个代码块外面你是不能使用它们的。下面是一个简单的错误重现示例：

from playwright.sync_api import sync_playwright # 1.40.0


with sync_playwright() as p:
    browser = p.chromium.launch()

browser.close()

这里有一个重写的建议：

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    url = "<Your URL>"
    page.goto(url, wait_until="domcontentloaded")
    team_urls = []

    for year in range(2024, 2022, -1):
        standings_table = page.locator("table.stats_table").first

        for x in standings_table.get_by_role("link").all():
            href = x.get_attribute("href")

            if "/squads/" in href:
                team_urls.append(f'https://www.fbref.com{href}')

    print(team_urls)
    browser.close()

阻塞资源可以帮助加快一些速度：

# ...
    def handle(route, request):
        block = "image", "script", "xhr", "fetch"
        if request.resource_type in block:
            return route.abort()

        route.continue_()

    page.route("**", handle)
    page.goto(url, wait_until="domcontentloaded")
# ...

不过你也可以更简单高效地做到这一点，而不需要使用Playwright，因为数据在静态HTML中是可以获取的：

import requests # 2.25.1
from bs4 import BeautifulSoup # 4.10.0


url = "<Your URL>"
soup = BeautifulSoup(requests.get(url).text, "lxml")
team_urls = []

for year in range(2024, 2022, -1):
    standings_table = soup.select_one("table.stats_table")

    for x in standings_table.select("a"):
        href = x["href"]

        if "/squads/" in href:
            team_urls.append(f'https://www.fbref.com{href}')

print(team_urls)

基准测试：

使用Playwright（并且阻塞资源）：

real 0m4.875s
user 0m1.331s
sys  0m0.250s

使用Requests/BS：

real 0m0.517s
user 0m0.376s
sys  0m0.029s

回答于 2025-04-14 由 Python大师

分享举报

从无头浏览器页面的第一张表中获取href链接（playwright._impl._errors.Error：事件循环已关闭！Playwright已经停止了吗？）

1 个回答

撰写回答