BeautifulSoup抓取工具有时运行良好，有时却失败——可能需要更多异常处理？

Question

不知道为什么，这个抓取 clutch.co 的程序在一个网站上运行得很好。

a. https://clutch.co/us/web-developers - 美国的分类：运行得非常好。
b. https://clutch.co/il/web-developers - 以色列的分类：就不行。

所以当我运行这个代码时，它只会从第一页获取信息，然后就自己关闭了。我加了等待时间让页面加载，但没什么用。看着浏览器的时候，可以看到它滚动到页面底部，但之后就自己关掉了。

这个程序对我来说是可以运行的 - 见下文：但只对美国的网站有效，不适用于其他网站，比如以色列的网站：a. https://clutch.co/us/web-developers - 这个运行得很好。b. https://clutch.co/il/web-developers - 它停止了，并返回了很多错误。

看起来有时候在找到类名为 'provider-info' 的元素时可能会有问题：我猜这可能是因为网站结构的变化，或者是一些时间上的问题。我觉得应该处理一下可能出现的异常；这个对我有效：

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time

website = "https://clutch.co/us/web-developers"
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", False)

driver = webdriver.Chrome(options=options)
driver.get(website)

wait = WebDriverWait(driver, 10)

# Function to handle page navigation
def navigate_to_next_page():
    try:
        next_page = driver.find_element(By.XPATH, '//li[@class="page-item next"]/a[@class="page-link"]')
        np = next_page.get_attribute('href')
        driver.get(np)
        time.sleep(6)
        return True
    except:
        return False

company_names = []
taglines = []
locations = []
costs = []
ratings = []

current_page = 1
last_page = 250

while current_page <= last_page:
    try:
        company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
    except TimeoutException:
        print("Timeout Exception occurred while waiting for company elements.")
        break

    for company_element in company_elements:
        try:
            company_name = company_element.find_element(By.CLASS_NAME, "company_info").text
            company_names.append(company_name)

            tagline = company_element.find_element(By.XPATH, './/p[@class="company_info__wrap tagline"]').text
            taglines.append(tagline)

            rating = company_element.find_element(By.XPATH, './/span[@class="rating sg-rating__number"]').text
            ratings.append(rating)

            location = company_element.find_element(By.XPATH, './/span[@class="locality"]').text
            locations.append(location)

            cost = company_element.find_element(By.XPATH, './/div[@class="list-item block_tag custom_popover"]').text
            costs.append(cost)
        except NoSuchElementException:
            print("Element not found while extracting company details.")
            continue

    current_page += 1

    if not navigate_to_next_page():
        break

driver.close()

data = {'Company_Name': company_names, 'Tagline': taglines, 'location': locations, 'Ticket_Price': costs, 'Rating': ratings}
df = pd.DataFrame(data)
df.to_csv('companies_test1.csv', index=False)
print(df)

它返回了以下内容

  import pandas as pd
Timeout Exception occurred while waiting for company elements.
                    Company_Name  ... Rating
0           Hyperlink InfoSystem  ...    4.9
1             Plego Technologies  ...    5.0
2                  Azuro Digital  ...    4.9
3                     Savas Labs  ...    5.0
4               The Gnar Company  ...    4.8
5            Sunrise Integration  ...    5.0
6             Baytech Consulting  ...    5.0
7                Inventive Works  ...    4.9
8                        Utility  ...    4.8
9                     Busy Human  ...    5.0
10                     Rootstrap  ...    4.8
11                        micro1  ...    4.9
12                  ChopDawg.com  ...    4.8
13             Emergent Software  ...    4.9
14         Beehive Software Inc.  ...    5.0
15                   3 Media Web  ...    4.9
16                     Webstacks  ...    5.0
17                Mutually Human  ...    5.0
18                    AnyforSoft  ...    4.8
19                  NL Softworks  ...    5.0
20  OpenSource Technologies Inc.  ...    4.8
21                Marcel Digital  ...    4.8
22                      Twin Sun  ...    5.0
23          SPARK Business Works  ...    4.9
24                        Darwin  ...    4.9
25                       Perrill  ...    5.0
26                          Nimi  ...    4.9
27                        Scopic  ...    4.9
28        Interactive Strategies  ...    4.9
29        Unleashed Technologies  ...    4.9
30                         Oyova  ...    4.9
31                  BrandExtract  ...    4.9
32             The Brick Factory  ...    4.9
33             My Web Programmer  ...    5.0
34                PureLogics LLC  ...    4.9
35                 Social Driver  ...    4.9
36            Calibrate Software  ...    4.9
37                    VisualFizz  ...    5.0
38               Camber Creative  ...    4.9
39               Susco Solutions  ...    4.9
40                  Lunarbyte.io  ...    5.0
41                    thoughtbot  ...    4.9
42         CR Software Solutions  ...    5.0
43             Solwey Consulting  ...    5.0
44                        Ambaum  ...    4.9
45          Pacific Codeline LLC  ...    5.0
46                          PERC  ...    5.0
47                   Beesoul LLC  ...    4.9
48                  Novalab Tech  ...    5.0
49                   Dragon Army  ...    5.0

[50 rows x 5 columns]

以及存储的以下数据：

进程以退出代码 0 完成

Company_Name,Tagline,Location,Ticket_Price,Rating,Website_Name,URL
Hyperlink InfoSystem,"#1 Mobile App, Web, & Software Development Company","Jersey City, NJ","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Plego Technologies,Shaping the Future of Technology,"Downers Grove, IL","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Azuro Digital,"Award-Winning Web Design, Development & SEO","New York, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
App Makers USA,Top US Mobile & Web App Development Agency,"Los Angeles, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
ChopDawg.com,Dreams Delivered Since 2009. Let's Make It App'n!®,"Philadelphia, PA","$5,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Savas Labs,Designing and developing elegant web products.,"Raleigh, NC","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
The Gnar Company,Solving Gnarly Software Problems. Faster.,"Boston, MA","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Sunrise Integration,Enterprise Solutions & Ecommerce Apps,"Los Angeles, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Baytech Consulting,TRANSLATING YOUR VISION INTO SOFTWARE,"Irvine, CA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Inventive Works,Custom Software Product Development,"Manor, TX","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Utility,AWARD-WINNING MOBILE DESIGN & DEVELOPMENT AGENCY,"New York, NY","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Busy Human,Making life more user-friendly,"Orem, UT","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Rootstrap,Outcome-driven development. At any scale.,"Beverly Hills, CA","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
micro1,"World-class software engineers, powered by AI","Los Angeles, CA","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Emergent Software,Your Full-Stack Technology Partner,"Saint Paul, MN","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
3 Media Web,Award-Winning Digital Experience Agency ,"Marlborough, MA","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Beehive Software Inc.,Software reinvented,"Los Gatos, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Webstacks,"The website is a product, not a project.","San Diego, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Mutually Human,Custom Software Development and Design,"Ada, MI","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
AnyforSoft,Amplify digital excellence with AnyforSoft,"Sarasota, FL","$50,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
NL Softworks,Website Design & Development Made to Convert,"Boston, MA","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
OpenSource Technologies Inc.,Web & Mobile APP | Digital Marketing | Cloud,"Lansdale, PA","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Twin Sun,Trustworthy partners that deliver results,"Nashville, TN","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Marcel Digital,Changing the Idea of What an Agency Is And Can Be,"Chicago, IL","$5,000+",4.7,Top Web Developers in the United States,https://clutch.co/us/web-developers
Darwin,We create incredible digital experiences,"Reston, VA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
SPARK Business Works,Award-winning custom software dev & web design,"Kalamazoo, MI","$5,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Nimi,"Bring your product ideas to life, to Grow Today.","Oakland, CA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Scopic,"Your Cross-continental, Digital Innovation Partner","Rutland, MA","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Interactive Strategies,"Full Service Digital Design, Dev & Marketing","Washington, DC","$100,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Unleashed Technologies,Unleash Your Potential®,"Ellicott City, MD","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Social Driver,Experience digital with us.,"Washington, DC","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Oyova,More Business For Your Business Is Our Business.™,"Jacksonville Beach, FL","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
The Brick Factory,A DC-based digital agency.,"Washington, DC","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
My Web Programmer,→Top-Quality Custom Software & Web Development Co.,"Atlanta, GA","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
PureLogics LLC,No Magic. Just Logic.,"New York, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
BrandExtract,"We inspire people to create, transform, and grow.","Houston, TX","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Calibrate Software,We craft digital experiences that spark joy ,"Chicago, IL","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Camber Creative,Things worth building are worth building well.,"Orlando, FL","$25,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
VisualFizz,Impactful Marketing for Industry-Leading Brands,"Chicago, IL","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Susco Solutions,Solve Together | Developing Intuitive Software,"Harvey, LA","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Lunarbyte.io,Launching big ideas with startups & enterprises,"Seattle, WA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
CR Software Solutions,Innovative Digital Solutions For Your Business,"Canton, MI","$5,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Ambaum,Ambaum is your Shopify Plus Agency,"Burien, WA","$5,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Solwey Consulting,Custom software solutions to elevate your business,"Austin, TX","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Pacific Codeline LLC,"Reliable, Experienced, 100% U.S. based.","San Clemente, CA","$1,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Novalab Tech,Your Trusted IT Partner,"San Francisco, CA","$10,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
Dragon Army,A purpose-driven digital engagement company.,"Atlanta, GA","$25,000+",5.0,Top Web Developers in the United States,https://clutch.co/us/web-developers
CodigoDelSur,Rockstar coders for rockstar companies,"Montevideo, Uruguay","$75,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
Brainhub,Top 1.36% engineering team - onboarding in 10 days,"Gliwice, Poland","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Curotec,Your digital product engineering department,"Philadelphia, PA","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
TekRevol,Creative Web | App | Software Development Company,"Houston, TX","$25,000+",4.8,Top Web Developers in the United States,https://clutch.co/us/web-developers
XWP,Building a better web at enterprise scale,"New York, NY","$50,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers
Five Jars,⭐️⭐️⭐️⭐️⭐️ OUTSTANDING WEB DESIGN & DEVELOPMENT,"Brooklyn, NY","$10,000+",4.9,Top Web Developers in the United States,https://clutch.co/us/web-developers

嗯，但等等：如果我们选择另一个基础网址，这里就不行了：https://clutch.co/il/web-developers

company details.
Element not found while extracting company details.
Element not found while extracting company details.
Timeout Exception occurred while waiting for company elements.
Traceback (most recent call last):
  File "/home/ubuntu/.config/JetBrains/PyCharmCE2023.3/scratches/scratch.py", line 74, in <module>
    df = pd.DataFrame(data)
         ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/frame.py", line 767, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
    index = _extract_index(arrays)
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/lib/python3.11/site-packages/pandas/core/internals/construction.py", line 677, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

Process finished with exit code 1

我觉得这和一些异常有关

  import pandas as pd
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Element not found while extracting company details.
Timeout Exception occurred while waiting for company elements.

我觉得可能有几个问题：

首先，在提取公司详情时，有些元素没有找到：这表明在提取某些公司的详情时，找不到一些元素。这可能是因为网站结构的变化或布局的改变。我想我们可以处理这个问题；因此我们应该加入额外的错误处理，或者优化我们的 XPath 表达式。

在多次尝试中，也出现了超时异常，等待公司元素时超时：这表明脚本在等待页面上元素加载时超时了。

最后，我还遇到了值错误：所有数组必须具有相同的长度：这个错误发生是因为用于构建数据框的数组长度不一致。这通常发生在一个或多个数据点没有被正确收集时。

见下方我使用的代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import pandas as pd
import time

website = "https://clutch.co/il/it-services"
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", False)

driver = webdriver.Chrome(options=options)
driver.get(website)

wait = WebDriverWait(driver, 20)

# Function to handle page navigation
def navigate_to_next_page():
    try:
        next_page = driver.find_element(By.XPATH, '//li[@class="page-item next"]/a[@class="page-link"]')
        np = next_page.get_attribute('href')
        driver.get(np)
        time.sleep(6)
        return True
    except:
        return False

company_names = []
taglines = []
locations = []
costs = []
ratings = []
websites = []

current_page = 1
last_page = 250

while current_page <= last_page:
    try:
        company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
    except TimeoutException:
        print("Timeout Exception occurred while waiting for company elements.")
        break

    for company_element in company_elements:
        try:
            company_name = company_element.find_element(By.CLASS_NAME, "company_info").text
            company_names.append(company_name)

            tagline = company_element.find_element(By.XPATH, './/p[@class="company_info__wrap tagline"]').text
            taglines.append(tagline)

            rating = company_element.find_element(By.XPATH, './/span[@class="rating sg-rating__number"]').text
            ratings.append(rating)

            location = company_element.find_element(By.XPATH, './/span[@class="locality"]').text
            locations.append(location)

            cost = company_element.find_element(By.XPATH, './/div[@class="list-item block_tag custom_popover"]').text
            costs.append(cost)

            # Extracting website URL
            website_element = company_element.find_element(By.XPATH, './/a[@class="website-link"]')
            website_url = website_element.get_attribute('href')
            websites.append(website_url)
        except NoSuchElementException:
            print("Element not found while extracting company details.")
            continue

    current_page += 1

    if not navigate_to_next_page():
        break

driver.close()

# Ensure all arrays have the same length
min_length = min(len(company_names), len(taglines), len(locations), len(costs), len(ratings), len(websites))
company_names = company_names[:min_length]
taglines = taglines[:min_length]
locations = locations[:min_length]
costs = costs[:min_length]
ratings = ratings[:min_length]
websites = websites[:min_length]

data = {'Company_Name': company_names, 'Tagline': taglines, 'Location': locations, 'Ticket_Price': costs, 'Rating': ratings, 'Website': websites}
df = pd.DataFrame(data)

# Check if DataFrame is empty
if not df.empty:
    df.to_csv('companies_test10.csv', index=False)
    print(df)
else:
    print("DataFrame is empty. No data to save.")

异常处理数据提取 xpath 数据框网络抓取网站结构超时异常元素查找

BeautifulSoup抓取工具有时运行良好，有时却失败——可能需要更多异常处理？

2 个回答

`提取公司信息时找不到元素`

`等待公司元素时发生超时异常`

`ValueError: 所有数组必须具有相同的长度`

使用API

撰写回答