Python网页抓取

2022-05-21 07:16:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我是python新手,正在尝试从以下站点获取数据。虽然这段代码适用于不同的站点,但我无法让它适用于nextgen stats。有人想知道为什么吗?下面是我的代码和我得到的错误

import pandas as pd
import numpy as np
import html5lib

urlwk1 = 'https://nextgenstats.nfl.com/stats/receiving/2020/1'
urlwk2 = 'https://nextgenstats.nfl.com/stats/receiving/2020/2'

df11 = pd.read_html(urlwk1)
df11[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv

下面是我得到的错误

df11 = pd.read_html(urlwk1) Traceback (most recent call last): File "", line 1, in File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\util_decorators.py", line 296, in wrapper return func(*args, **kwargs) File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 1101, in read_html displayed_only=displayed_only, File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 917, in _parse raise retained File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 898, in _parse tables = p.parse_tables() File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 217, in parse_tables tables = self._parse_tables(self._build_doc(), self.match, self.attrs) File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 547, in _parse_tables raise ValueError("No tables found") ValueError: No tables found df11[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv Traceback (most recent call last): File "", line 1, in NameError: name 'df11' is not defined


Tags: csvinpandastablesindexpackageslocalhtmllineusersappdatafileuserxpythonsoftwarefoundation
2条回答
网友
1楼 ·

Pandaspandas.read_html无法解析动态加载的html表

page正在使用API调用获取该表数据

您可以使用下面的代码获取和解析API响应

import requests
import pandas as pd

headers = {
    'accept': 'application/json, text/plain, */*',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36',
    'referer': 'https://nextgenstats.nfl.com/',
    'accept-language': 'en-US,en;q=0.9,hi;q=0.8',
}

response = requests.get('https://appapi.ngs.nfl.com/statboard/receiving?season=2020&seasonType=REG&week=2', headers=headers)

df = pd.read_json(response.content)
df.to_csv ('NFL_Receiving_Page1.csv', index=False)

在行动中看到它here

网友
2楼 ·

Read HTML using Selenium Driver and read html

我认为您提到的页面地址是动态加载的。请参考上面的帖子,然后尝试下面的代码

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time

chrome_options = Options()
chrome_options.add_argument(' headless')
chrome_options.add_argument(' no-sandbox')
chrome_options.add_argument(' disable-dev-shm-usage')
chromedriver_path = '/home/user/chromedriver'

d = webdriver.Chrome(chromedriver_path,chrome_options=chrome_options)
d.get('https://nextgenstats.nfl.com/stats/receiving/2020/1')
time.sleep(3)
html = d.page_source
df = pd.read_html(html)

在任何系统中正确安装chrome驱动程序后,此代码将正常工作。尝试根据您的internet速度和系统中的chromedrive路径设置time.sleep()