BeautifulSoup4和Pandas返回空DataFrame列：更新：现在在Google-Colab上使用Selenium

Question

我在找一个全球银行的公开列表。

我不需要分支机构和完整地址，只要银行的名字和网站就可以了。我想要的数据格式可以是XML、CSV等，包含以下字段：银行名称、国家名称或国家代码（ISO两字母），网站；可选的还有银行总部所在城市。每个银行在每个国家的记录只需要一条。顺便提一下：特别是小银行也很有意思。

我找到一个非常全面的页面——它有9000家欧洲的银行：

从A到Z查看：

https://thebanks.eu/search

**A**
https://thebanks.eu/search?bank=&country=Albania
https://thebanks.eu/search?bank=&country=Andorra
https://thebanks.eu/search?bank=&country=Anguilla

**B**
https://thebanks.eu/search?bank=&country=Belgium


**U** 
https://thebanks.eu/search?bank=&country=Ukraine
https://thebanks.eu/search?bank=&country=United+Kingdom

查看详细页面：https://thebanks.eu/banks/9563

我需要这些数据。

联系方式

Mitteldorfstrasse 48, 9524, Zuzwil SG, Switzerland
071 944 15 51071 944 27 52
https://www.bankbiz.ch/

我的方法是使用bs4、requests和pandas。

顺便说一下：也许我们可以从零数到100,000，以便获取数据库中存储的所有银行：

查看详细页面：https://thebanks.eu/banks/9563

我在Colab上运行了这个：

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape bank data from my URL
def scrape_bank_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # here we try to find bank name, country, and website
    bank_name = soup.find("h1", class_="entry-title").text.strip()
    country = soup.find("span", class_="country-name").text.strip()
    website = soup.find("a", class_="site-url").text.strip()
    print(f"Scraped: {bank_name}, {country}, {website}")

    return {"Bank Name": bank_name, "Country": country, "Website": website}

# the list of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search",
    "https://thebanks.eu/search?bank=&country=Albania",
    "https://thebanks.eu/search?bank=&country=Andorra",
    #  we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    bank_links = soup.find_all("div", class_="search-bank")

    for bank_link in bank_links:
        bank_url = "https://thebanks.eu" + bank_link.find("a").get("href")
        bank_info = scrape_bank_data(bank_url)
        bank_data.append(bank_info)

#  and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# subsequently we print the DataFrame
print(df)

看看返回了什么。

Empty DataFrame
Columns: []
Index: []

嗯，我觉得抓取过程有问题。我尝试了不同的方法，一遍又一遍地检查网页上的元素，以确保我提取了正确的信息。

我还应该打印一些额外的调试信息，以帮助诊断问题。

更新：晚上好，亲爱的@Asish M.和@eternal_white，感谢你们的评论和分享的想法：值得思考的内容：关于Selenium，我觉得这是个好主意——在Google Colab上运行它（Selenium），我从Jacob Padilla那里学到了。

@Jacob / @user:21216449 :: 查看Jacob的页面：https://github.com/jpjacobpadilla，以及Google Colab的Selenium：https://github.com/jpjacobpadilla/Google-Colab-Selenium和默认选项：

The google-colab-selenium package is preconfigured with a set of default options optimized for Google Colab environments. These defaults include:
    • --headless: Runs Chrome in headless mode (without a GUI). 
    • --no-sandbox: Disables the Chrome sandboxing feature, necessary in the Colab environment. 
    • --disable-dev-shm-usage: Prevents issues with limited shared memory in Docker containers. 
    • --lang=en: Sets the language to English.

我认为这个方法值得考虑：我们可以这样做：

在Google Colab中使用Selenium来绕过Cloudflare（你提到的，eternal_white）的阻挡，抓取所需的数据是一个不错但可行的方法。这里有一些关于逐步方法的想法，以及如何使用Jacob Padilla的google-colab-selenium包进行设置：

Install google-colab-selenium:
You can install the google-colab-selenium package using pip:

diff

!pip install google-colab-selenium

我们还需要安装Selenium：

差异

!pip install selenium

Import Necessary Libraries:
Import the required libraries in your Colab notebook:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from google.colab import output
import time

然后我们需要设置Selenium WebDriver：配置Chrome WebDriver和必要的选项：

# Set up options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=options)

在这里我们将定义抓取的函数：我们定义一个使用Selenium抓取银行数据的函数：

def scrape_bank_data_with_selenium(url):
    driver.get(url)
    time.sleep(5)  # first of all - we let the page load completely
    
    bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
    country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
    website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
    print(f"Scraped: {bank_name}, {country}, {website}")

    return {"Bank Name": bank_name, "Country": country, "Website": website}

然后我们可以去抓取数据：现在我们可以使用定义的函数抓取数据：

# List of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search",
    "https://thebanks.eu/search?bank=&country=Albania",
    "https://thebanks.eu/search?bank=&country=Andorra",
    # hmm - we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# now we can iterate through the URLs and scrape bank data
for url in urls:
    bank_data.append(scrape_bank_data_with_selenium(url))

# and now we can convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# Print the DataFrame
print(df)

而且——一次性完成：

# first of all we need to install all the required packages - for example the Packages of Jakobs Selenium approach etc etx: 
!pip install google-colab-selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

# and afterwards we need to import all the necessary libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time

# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222')  # Add this option

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=chrome_options)

# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
    driver.get(url)
    time.sleep(5)  # Let the page load completely
    
    bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
    country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
    website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
    print(f"Scraped: {bank_name}, {country}, {website}")

    return {"Bank Name": bank_name, "Country": country, "Website": website}

# List of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search",
    "https://thebanks.eu/search?bank=&country=Albania",
    "https://thebanks.eu/search?bank=&country=Andorra",
    # Add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
    bank_data.append(scrape_bank_data_with_selenium(url))

# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# Print the DataFrame
print(df)

# Close the WebDriver
driver.quit()

看看我在google-colab上得到了什么：

TypeError                                 Traceback (most recent call last)

<ipython-input-4-76a7abf92dba> in <cell line: 21>()
     19 
     20 # Create a new instance of the Chrome driver
---> 21 driver = webdriver.Chrome('chromedriver', options=chrome_options)
     22 
     23 # Define function to scrape bank data using Selenium

TypeError: WebDriver.__init__() got multiple values for argument 'options'

更新：顺便说一下，如果我们想收集所有国家的数据，可以这样做：

"http://thebanks.eu/search?bank=&country=Albania"
"http://thebanks.eu/search?bank=&country=Andorra"
"http://thebanks.eu/search?bank=&country=Anguilla"
"http://thebanks.eu/search?bank=&country=Austria"
"http://thebanks.eu/search?bank=&country=Belgium"
"http://thebanks.eu/search?bank=&country=Bermuda"
"http://thebanks.eu/search?bank=&country=Bosnia and Herzegovina"
"http://thebanks.eu/search?bank=&country=British Virgin Islands"
"http://thebanks.eu/search?bank=&country=Bulgaria"
"http://thebanks.eu/search?bank=&country=Cayman Islands"
"http://thebanks.eu/search?bank=&country=Croatia"
"http://thebanks.eu/search?bank=&country=Curacao"
"http://thebanks.eu/search?bank=&country=Cyprus"
"http://thebanks.eu/search?bank=&country=Czech Republic"
"http://thebanks.eu/search?bank=&country=Denmark"
"http://thebanks.eu/search?bank=&country=Estonia"
"http://thebanks.eu/search?bank=&country=Finland"
"http://thebanks.eu/search?bank=&country=France"
"http://thebanks.eu/search?bank=&country=Georgia"
"http://thebanks.eu/search?bank=&country=Germany"
"http://thebanks.eu/search?bank=&country=Gibraltar"
"http://thebanks.eu/search?bank=&country=Greece"
"http://thebanks.eu/search?bank=&country=Guernsey"
"http://thebanks.eu/search?bank=&country=Hungary"
"http://thebanks.eu/search?bank=&country=Iceland"
"http://thebanks.eu/search?bank=&country=Ireland"
"http://thebanks.eu/search?bank=&country=Isle of Man"
"http://thebanks.eu/search?bank=&country=Italy"
"http://thebanks.eu/search?bank=&country=Jersey"
"http://thebanks.eu/search?bank=&country=Latvia"
"http://thebanks.eu/search?bank=&country=Liechtenstein"
"http://thebanks.eu/search?bank=&country=Lithuania"
"http://thebanks.eu/search?bank=&country=Luxembourg"
"http://thebanks.eu/search?bank=&country=Macedonia"
"http://thebanks.eu/search?bank=&country=Malta"
"http://thebanks.eu/search?bank=&country=Monaco"
"http://thebanks.eu/search?bank=&country=Montenegro"
"http://thebanks.eu/search?bank=&country=Netherlands"
"http://thebanks.eu/search?bank=&country=Norway"
"http://thebanks.eu/search?bank=&country=Poland"
"http://thebanks.eu/search?bank=&country=Portugal"
"http://thebanks.eu/search?bank=&country=Romania"
"http://thebanks.eu/search?bank=&country=San Marino"
"http://thebanks.eu/search?bank=&country=Serbia"
"http://thebanks.eu/search?bank=&country=Slovakia"
"http://thebanks.eu/search?bank=&country=Slovenia"
"http://thebanks.eu/search?bank=&country=Spain"
"http://thebanks.eu/search?bank=&country= Sweden"
"http://thebanks.eu/search?bank=&country=Switzerland"
"http://thebanks.eu/search?bank=&country=Turkey"
"http://thebanks.eu/search?bank=&country=Turks and Caicos Islands"
"http://thebanks.eu/search?bank=&country=Ukraine"
"http://thebanks.eu/search?bank=&country=United Kingdom"

数据分析网络爬虫数据格式 beautifulsoup 数据抓取 selenium google colab 银行信息

BeautifulSoup4和Pandas返回空DataFrame列：更新：现在在Google-Colab上使用Selenium

1 个回答

撰写回答