BeautifulSoup抓取工具有时运行良好,有时却失败——可能需要更多异常处理?
不知道为什么,这个抓取 clutch.co 的程序在一个网站上运行得很好。
- a. https://clutch.co/us/web-developers - 美国的分类:运行得非常好。
- b. https://clutch.co/il/web-developers - 以色列的分类:就不行。
所以当我运行这个代码时,它只会从第一页获取信息,然后就自己关闭了。我加了等待时间让页面加载,但没什么用。看着浏览器的时候,可以看到它滚动到页面底部,但之后就自己关掉了。
这个程序对我来说是可以运行的 - 见下文:但只对美国的网站有效,不适用于其他网站,比如以色列的网站:a. https://clutch.co/us/web-developers - 这个运行得很好。b. https://clutch.co/il/web-developers - 它停止了,并返回了很多错误。
看起来有时候在找到类名为 'provider-info' 的元素时可能会有问题:我猜这可能是因为网站结构的变化,或者是一些时间上的问题。我觉得应该处理一下可能出现的异常;这个对我有效:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time
website = "https://clutch.co/us/web-developers"
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", False)
driver = webdriver.Chrome(options=options)
driver.get(website)
wait = WebDriverWait(driver, 10)
# Function to handle page navigation
def navigate_to_next_page():
try:
next_page = driver.find_element(By.XPATH, '//li[@class="page-item next"]/a[@class="page-link"]')
np = next_page.get_attribute('href')
driver.get(np)
time.sleep(6)
return True
except:
return False
company_names = []
taglines = []
locations = []
costs = []
ratings = []
current_page = 1
last_page = 250
while current_page <= last_page:
try:
company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
except TimeoutException:
print("Timeout Exception occurred while waiting for company elements.")
break
for company_element in company_elements:
try:
company_name = company_element.find_element(By.CLASS_NAME, "company_info").text
company_names.append(company_name)
tagline = company_element.find_element(By.XPATH, './/p[@class="company_info__wrap tagline"]').text
taglines.append(tagline)
rating = company_element.find_element(By.XPATH, './/span[@class="rating sg-rating__number"]').text
ratings.append(rating)
location = company_element.find_element(By.XPATH, './/span[@class="locality"]').text
locations.append(location)
cost = company_element.find_element(By.XPATH, './/div[@class="list-item block_tag custom_popover"]').text
costs.append(cost)
except NoSuchElementException:
print("Element not found while extracting company details.")
continue
current_page += 1
if not navigate_to_next_page():
break
driver.close()
data = {'Company_Name': company_names, 'Tagline': taglines, 'location': locations, 'Ticket_Price': costs, 'Rating': ratings}
df = pd.DataFrame(data)
df.to_csv('companies_test1.csv', index=False)
print(df)
它返回了以下内容
import pandas as pd
Timeout Exception occurred while waiting for company elements.
Company_Name ... Rating
0 Hyperlink InfoSystem ... 4.9
1 Plego Technologies ... 5.0
2 Azuro Digital ... 4.9
3 Savas Labs ... 5.0
4 The Gnar Company ... 4.8
5 Sunrise Integration ... 5.0
6 Baytech Consulting ... 5.0
7 Inventive Works ... 4.9
8 Utility ... 4.8
9 Busy Human ... 5.0
10 Rootstrap ... 4.8
11 micro1 ... 4.9
12 ChopDawg.com ... 4.8
13 Emergent Software ... 4.9
14 Beehive Software Inc. ... 5.0
15 3 Media Web ... 4.9
16 Webstacks ... 5.0
17 Mutually Human ... 5.0
18 AnyforSoft ... 4.8
19 NL Softworks ... 5.0
20 OpenSource Technologies Inc. ... 4.8
21 Marcel Digital ... 4.8
22 Twin Sun ... 5.0
23 SPARK Business Works ... 4.9
24 Darwin ... 4.9
25 Perrill ... 5.0
26 Nimi ... 4.9
27 Scopic ... 4.9
28 Interactive Strategies ... 4.9
29 Unleashed Technologies ... 4.9
30 Oyova ... 4.9
31 BrandExtract ... 4.9
32 The Brick Factory ... 4.9
33 My Web Programmer ... 5.0
34 PureLogics LLC ... 4.9
35 Social Driver ... 4.9
36 Calibrate Software ... 4.9
37 VisualFizz ... 5.0
38 Camber Creative ... 4.9
39 Susco Solutions ... 4.9
40 Lunarbyte.io ... 5.0
41 thoughtbot ... 4.9
42 CR Software Solutions ... 5.0
43 Solwey Consulting ... 5.0
44 Ambaum ... 4.9
45 Pacific Codeline LLC ... 5.0
46 PERC ... 5.0
47 Beesoul LLC ... 4.9
48 Novalab Tech ... 5.0
49 Dragon Army ... 5.0
[50 rows x 5 columns]
以及存储的以下数据:
进程以退出代码 0 完成
Company_Name,Tagline,Location,Ticket_Price,Rating,Website_Name,URL
Hyperlink InfoSystem,"#1 Mobile App, Web, & Software Development Company","Jersey City, NJ","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Plego Technologies,Shaping the Future of Technology,"Downers Grove, IL","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Azuro Digital,"Award-Winning Web Design, Development & SEO","New York, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
App Makers USA,Top US Mobile & Web App Development Agency,"Los Angeles, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
ChopDawg.com,Dreams Delivered Since 2009. Let's Make It App'n!®,"Philadelphia, PA","$5,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Savas Labs,Designing and developing elegant web products.,"Raleigh, NC","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
The Gnar Company,Solving Gnarly Software Problems. Faster.,"Boston, MA","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Sunrise Integration,Enterprise Solutions & Ecommerce Apps,"Los Angeles, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Baytech Consulting,TRANSLATING YOUR VISION INTO SOFTWARE,"Irvine, CA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Inventive Works,Custom Software Product Development,"Manor, TX","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Utility,AWARD-WINNING MOBILE DESIGN & DEVELOPMENT AGENCY,"New York, NY","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Busy Human,Making life more user-friendly,"Orem, UT","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Rootstrap,Outcome-driven development. At any scale.,"Beverly Hills, CA","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
micro1,"World-class software engineers, powered by AI","Los Angeles, CA","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Emergent Software,Your Full-Stack Technology Partner,"Saint Paul, MN","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
3 Media Web,Award-Winning Digital Experience Agency ,"Marlborough, MA","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Beehive Software Inc.,Software reinvented,"Los Gatos, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Webstacks,"The website is a product, not a project.","San Diego, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Mutually Human,Custom Software Development and Design,"Ada, MI","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
AnyforSoft,Amplify digital excellence with AnyforSoft,"Sarasota, FL","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
NL Softworks,Website Design & Development Made to Convert,"Boston, MA","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
OpenSource Technologies Inc.,Web & Mobile APP | Digital Marketing | Cloud,"Lansdale, PA","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Twin Sun,Trustworthy partners that deliver results,"Nashville, TN","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Marcel Digital,Changing the Idea of What an Agency Is And Can Be,"Chicago, IL","$5,000+",4.7,Top Web Developers in the United States,https://clutch.co/us/web-developers
Darwin,We create incredible digital experiences,"Reston, VA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
SPARK Business Works,Award-winning custom software dev & web design,"Kalamazoo, MI","$5,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Nimi,"Bring your product ideas to life, to Grow Today.","Oakland, CA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Scopic,"Your Cross-continental, Digital Innovation Partner","Rutland, MA","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Interactive Strategies,"Full Service Digital Design, Dev & Marketing","Washington, DC","$100,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Unleashed Technologies,Unleash Your Potential®,"Ellicott City, MD","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Social Driver,Experience digital with us.,"Washington, DC","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Oyova,More Business For Your Business Is Our Business.™,"Jacksonville Beach, FL","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
The Brick Factory,A DC-based digital agency.,"Washington, DC","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
My Web Programmer,→Top-Quality Custom Software & Web Development Co.,"Atlanta, GA","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
PureLogics LLC,No Magic. Just Logic.,"New York, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
BrandExtract,"We inspire people to create, transform, and grow.","Houston, TX","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Calibrate Software,We craft digital experiences that spark joy ,"Chicago, IL","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Camber Creative,Things worth building are worth building well.,"Orlando, FL","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
VisualFizz,Impactful Marketing for Industry-Leading Brands,"Chicago, IL","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Susco Solutions,Solve Together | Developing Intuitive Software,"Harvey, LA","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Lunarbyte.io,Launching big ideas with startups & enterprises,"Seattle, WA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
CR Software Solutions,Innovative Digital Solutions For Your Business,"Canton, MI","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Ambaum,Ambaum is your Shopify Plus Agency,"Burien, WA","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Solwey Consulting,Custom software solutions to elevate your business,"Austin, TX","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Pacific Codeline LLC,"Reliable, Experienced, 100% U.S. based.","San Clemente, CA","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Novalab Tech,Your Trusted IT Partner,"San Francisco, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Dragon Army,A purpose-driven digital engagement company.,"Atlanta, GA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
CodigoDelSur,Rockstar coders for rockstar companies,"Montevideo, Uruguay","$75,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Brainhub,Top 1.36% engineering team - onboarding in 10 days,"Gliwice, Poland","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Curotec,Your digital product engineering department,"Philadelphia, PA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
TekRevol,Creative Web | App | Software Development Company,"Houston, TX","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
XWP,Building a better web at enterprise scale,"New York, NY","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Five Jars,⭐️⭐️⭐️⭐️⭐️ OUTSTANDING WEB DESIGN & DEVELOPMENT,"Brooklyn, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
嗯,但等等:如果我们选择另一个基础网址,这里就不行了:https://clutch.co/il/web-developers
company details.
Element not found while extracting company details.
Element not found while extracting company details.
Timeout Exception occurred while waiting for company elements.
Traceback (most recent call last):
File "/home/ubuntu/.config/JetBrains/PyCharmCE2023.3/scratches/scratch.py", line 74, in <module>
df = pd.DataFrame(data)
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 767, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
index = _extract_index(arrays)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 677, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length
Process finished with exit code 1
我觉得这和一些异常有关
import pandas as pd
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Timeout Exception occurred while waiting for company elements.
我觉得可能有几个问题:
首先,在提取公司详情时,有些元素没有找到:这表明在提取某些公司的详情时,找不到一些元素。这可能是因为网站结构的变化或布局的改变。我想我们可以处理这个问题;因此我们应该加入额外的错误处理,或者优化我们的 XPath 表达式。
在多次尝试中,也出现了超时异常,等待公司元素时超时:这表明脚本在等待页面上元素加载时超时了。
最后,我还遇到了值错误:所有数组必须具有相同的长度:这个错误发生是因为用于构建数据框的数组长度不一致。这通常发生在一个或多个数据点没有被正确收集时。
见下方我使用的代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time
website = "https://clutch.co/il/it-services"
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", False)
driver = webdriver.Chrome(options=options)
driver.get(website)
wait = WebDriverWait(driver, 20)
# Function to handle page navigation
def navigate_to_next_page():
try:
next_page = driver.find_element(By.XPATH, '//li[@class="page-item next"]/a[@class="page-link"]')
np = next_page.get_attribute('href')
driver.get(np)
time.sleep(6)
return True
except:
return False
company_names = []
taglines = []
locations = []
costs = []
ratings = []
websites = []
current_page = 1
last_page = 250
while current_page <= last_page:
try:
company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
except TimeoutException:
print("Timeout Exception occurred while waiting for company elements.")
break
for company_element in company_elements:
try:
company_name = company_element.find_element(By.CLASS_NAME, "company_info").text
company_names.append(company_name)
tagline = company_element.find_element(By.XPATH, './/p[@class="company_info__wrap tagline"]').text
taglines.append(tagline)
rating = company_element.find_element(By.XPATH, './/span[@class="rating sg-rating__number"]').text
ratings.append(rating)
location = company_element.find_element(By.XPATH, './/span[@class="locality"]').text
locations.append(location)
cost = company_element.find_element(By.XPATH, './/div[@class="list-item block_tag custom_popover"]').text
costs.append(cost)
# Extracting website URL
website_element = company_element.find_element(By.XPATH, './/a[@class="website-link"]')
website_url = website_element.get_attribute('href')
websites.append(website_url)
except NoSuchElementException:
print("Element not found while extracting company details.")
continue
current_page += 1
if not navigate_to_next_page():
break
driver.close()
# Ensure all arrays have the same length
min_length = min(len(company_names), len(taglines), len(locations), len(costs), len(ratings), len(websites))
company_names = company_names[:min_length]
taglines = taglines[:min_length]
locations = locations[:min_length]
costs = costs[:min_length]
ratings = ratings[:min_length]
websites = websites[:min_length]
data = {'Company_Name': company_names, 'Tagline': taglines, 'Location': locations, 'Ticket_Price': costs, 'Rating': ratings, 'Website': websites}
df = pd.DataFrame(data)
# Check if DataFrame is empty
if not df.empty:
df.to_csv('companies_test10.csv', index=False)
print(df)
else:
print("DataFrame is empty. No data to save.")
2 个回答
免责声明:我假设大家都是成年人,知道在某些情况下抓取数据是违法的。
编辑:
我对代码做了一些小修改,并添加了seleniumbase的初始化设置,使用了undetectable chrome
= True
。这部分不在讨论范围内,但如果你想要不被检测到,可以查看这个链接或者这个链接。我特别喜欢seleniumbase,因为它会自动管理chromedriver的下载和与chrome版本的匹配。
我建议在每个元素的处理上使用try / except。这样,你在每个列表中总会有相同数量的结果(之后你可以检查哪些是None)。更重要的是,这样可以避免不好的做法,比如切片结果(这可能会因为数据的变化而导致不可靠的结果)。
看看这个代码片段:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from seleniumbase import Driver
import pandas as pd
import time
website = "https://clutch.co/il/it-services"
driver = Driver(uc=True)
driver.get(website)
wait = WebDriverWait(driver, 20)
# Function to handle page navigation
def navigate_to_next_page():
try:
next_page = driver.find_element(
By.XPATH, '//li[@class="page-item next"]/a[@class="page-link"]'
)
np = next_page.get_attribute("href")
driver.get(np)
time.sleep(6)
return True
except:
return False
def find_in_company_element(element, element_str, element_type=By.XPATH):
try:
return element.find_element(element_type, element_str)
except:
return None
data = dict(
company_name=[],
tagline=[],
location=[],
cost=[],
rating=[],
website=[],
)
current_page = 1
last_page = 250
while current_page <= last_page:
try:
try:
company_elements = wait.until(
EC.presence_of_all_elements_located((By.CLASS_NAME, "provider-info"))
)
except TimeoutException:
print("Timeout Exception occurred while waiting for company elements.")
break
for company_element in company_elements:
data["company_name"].append(
getattr(
find_in_company_element(
company_element, "company_info", element_type=By.CLASS_NAME
),
"text",
None,
)
)
data["tagline"].append(
getattr(
find_in_company_element(
company_element, './/p[@class="company_info__wrap tagline"]'
),
"text",
None,
)
)
data["rating"].append(
getattr(
find_in_company_element(
company_element, './/span[@class="rating sg-rating__number"]'
),
"text",
None,
)
)
data["location"].append(
getattr(
find_in_company_element(
company_element, './/span[@class="locality"]'
),
"text",
None,
)
)
data["cost"].append(
getattr(
find_in_company_element(
company_element,
'.//div[@class="list-item block_tag custom_popover"]',
),
"text",
None,
)
)
website_element = find_in_company_element(
company_element,
'.//a[@class="website-link"])',
)
data["website"].append(
website_element.get_attribute("href")
if website_element is not None
else None
)
current_page += 1
if not navigate_to_next_page():
break
except KeyboardInterrupt:
print("KeyboardInterrupt, halting")
break
driver.close()
df = pd.DataFrame(data)
# Check if DataFrame is empty
if not df.empty:
df.to_csv("companies_test10.csv", index=False)
print(df)
else:
print("DataFrame is empty. No data to save.")
我猜 使用普通的selenium初始化,这段代码会被cloudflare屏蔽,但在不被检测的chrome模式下可以正常工作。此外,我认为当前的代码片段在某些情况下可能会缺少评分,网站列可能完全为空(所以可能需要重新审查一下)。
很遗憾,我觉得抓取数据并不是一个实际的解决方案。建议你使用API。
我们来逐个解决你的问题:
提取公司信息时找不到元素
这个问题很简单。页面上有一个元素找不到,所以你可以在你用来收集数据的列表中加点别的东西来替代它:
[...]
except NoSuchElementException:
print("Element not found while extracting company details.")
company_names.append("")
taglines.append("")
ratings.append("")
locations.append("")
costs.append("")
continue
[...]
等待公司元素时发生超时异常
这就是你主要的问题所在。clutch.co
使用了Cloudflare,经过一段时间的请求后,它会限制你的请求,并把你引导到一个验证码页面。他们这样做的原因就是为了防止自动化的机器人收集他们的数据。你可以在这里了解更多。
所以当这种情况发生时,你会收到一个TimeoutException
:因为加载数据花了很长时间,selenium就认为数据不会加载了,于是抛出了这个异常。你可以增加超时时间,但这并不实际,也不会长久。
首先,你需要为每个页面解决一个验证码,这会耗费时间。你可以雇佣一个服务来帮你解决这个问题,但这会花钱。
而且,更重要的是,如果你继续通过Cloudflare进行自动请求,他们可能会在某个时候把你的IP加入黑名单,这样你就得开始使用代理服务。这同样也会花钱。
如果你真的想走这条路,可以试试Cloudscraper。
ValueError: 所有数组必须具有相同的长度
这是之前问题的结果。Pandas期望所有数据列表(company_names
、taglines
、locations
、costs
和ratings
)的长度是一样的,因为它们是数据框的行。当它们长度不一致时,就会出现这个错误。
所以像这样是行不通的……
df = pd.DataFrame({"a": [1, 2], "b": [1]}) # will raise ValueError
但这样就可以了
df = pd.DataFrame({"a": [1, 2], "b": [1, 3]})
如果你能解决上面的问题并收集到所有数据,这个错误也会消失。
使用API
如果API提供了你需要的所有数据,我建议你使用它,即使是付费的,也比尝试抓取数据要好得多。这会减少出错的可能性,并且开发时间也会更少。最终你可能还会省钱。