BeautifulSoup4和Pandas返回空DataFrame列:更新:现在在Google-Colab上使用Selenium

-2 投票
1 回答
251 浏览
提问于 2025-04-14 17:27

我在找一个全球银行的公开列表。

我不需要分支机构和完整地址,只要银行的名字和网站就可以了。我想要的数据格式可以是XML、CSV等,包含以下字段:银行名称、国家名称或国家代码(ISO两字母),网站;可选的还有银行总部所在城市。每个银行在每个国家的记录只需要一条。顺便提一下:特别是小银行也很有意思。

我找到一个非常全面的页面——它有9000家欧洲的银行:

从A到Z查看:

https://thebanks.eu/search

**A**
https://thebanks.eu/search?bank=&country=Albania
https://thebanks.eu/search?bank=&country=Andorra
https://thebanks.eu/search?bank=&country=Anguilla

**B**
https://thebanks.eu/search?bank=&country=Belgium


**U** 
https://thebanks.eu/search?bank=&country=Ukraine
https://thebanks.eu/search?bank=&country=United+Kingdom

查看详细页面:https://thebanks.eu/banks/9563

我需要这些数据。

联系方式
Mitteldorfstrasse 48, 9524, Zuzwil SG, Switzerland
071 944 15 51071 944 27 52
https://www.bankbiz.ch/

我的方法是使用bs4、requests和pandas。

顺便说一下:也许我们可以从零数到100,000,以便获取数据库中存储的所有银行:

查看详细页面:https://thebanks.eu/banks/9563

我在Colab上运行了这个:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape bank data from my URL
def scrape_bank_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # here we try to find bank name, country, and website
    bank_name = soup.find("h1", class_="entry-title").text.strip()
    country = soup.find("span", class_="country-name").text.strip()
    website = soup.find("a", class_="site-url").text.strip()
    print(f"Scraped: {bank_name}, {country}, {website}")

    return {"Bank Name": bank_name, "Country": country, "Website": website}

# the list of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search",
    "https://thebanks.eu/search?bank=&country=Albania",
    "https://thebanks.eu/search?bank=&country=Andorra",
    #  we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    bank_links = soup.find_all("div", class_="search-bank")

    for bank_link in bank_links:
        bank_url = "https://thebanks.eu" + bank_link.find("a").get("href")
        bank_info = scrape_bank_data(bank_url)
        bank_data.append(bank_info)

#  and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# subsequently we print the DataFrame
print(df)

看看返回了什么。

Empty DataFrame
Columns: []
Index: []

嗯,我觉得抓取过程有问题。我尝试了不同的方法,一遍又一遍地检查网页上的元素,以确保我提取了正确的信息。

我还应该打印一些额外的调试信息,以帮助诊断问题。

更新:晚上好,亲爱的@Asish M.和@eternal_white,感谢你们的评论和分享的想法:值得思考的内容:关于Selenium,我觉得这是个好主意——在Google Colab上运行它(Selenium),我从Jacob Padilla那里学到了。

@Jacob / @user:21216449 :: 查看Jacob的页面:https://github.com/jpjacobpadilla,以及Google Colab的Selenium:https://github.com/jpjacobpadilla/Google-Colab-Selenium和默认选项:

The google-colab-selenium package is preconfigured with a set of default options optimized for Google Colab environments. These defaults include:
    • --headless: Runs Chrome in headless mode (without a GUI). 
    • --no-sandbox: Disables the Chrome sandboxing feature, necessary in the Colab environment. 
    • --disable-dev-shm-usage: Prevents issues with limited shared memory in Docker containers. 
    • --lang=en: Sets the language to English.

我认为这个方法值得考虑:我们可以这样做:

在Google Colab中使用Selenium来绕过Cloudflare(你提到的,eternal_white)的阻挡,抓取所需的数据是一个不错但可行的方法。这里有一些关于逐步方法的想法,以及如何使用Jacob Padilla的google-colab-selenium包进行设置:

Install google-colab-selenium:
You can install the google-colab-selenium package using pip:

diff

!pip install google-colab-selenium

我们还需要安装Selenium:

差异

!pip install selenium

Import Necessary Libraries:
Import the required libraries in your Colab notebook:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from google.colab import output
import time

然后我们需要设置Selenium WebDriver:配置Chrome WebDriver和必要的选项:

# Set up options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=options)

在这里我们将定义抓取的函数:我们定义一个使用Selenium抓取银行数据的函数:

def scrape_bank_data_with_selenium(url):
    driver.get(url)
    time.sleep(5)  # first of all - we let the page load completely
    
    bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
    country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
    website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
    print(f"Scraped: {bank_name}, {country}, {website}")

    return {"Bank Name": bank_name, "Country": country, "Website": website}

然后我们可以去抓取数据:现在我们可以使用定义的函数抓取数据:

# List of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search",
    "https://thebanks.eu/search?bank=&country=Albania",
    "https://thebanks.eu/search?bank=&country=Andorra",
    # hmm - we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# now we can iterate through the URLs and scrape bank data
for url in urls:
    bank_data.append(scrape_bank_data_with_selenium(url))

# and now we can convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# Print the DataFrame
print(df)

而且——一次性完成:

# first of all we need to install all the required packages - for example the Packages of Jakobs Selenium approach etc etx: 
!pip install google-colab-selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

# and afterwards we need to import all the necessary libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time

# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222')  # Add this option

# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=chrome_options)

# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
    driver.get(url)
    time.sleep(5)  # Let the page load completely
    
    bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
    country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
    website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
    print(f"Scraped: {bank_name}, {country}, {website}")

    return {"Bank Name": bank_name, "Country": country, "Website": website}

# List of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search",
    "https://thebanks.eu/search?bank=&country=Albania",
    "https://thebanks.eu/search?bank=&country=Andorra",
    # Add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
    bank_data.append(scrape_bank_data_with_selenium(url))

# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# Print the DataFrame
print(df)

# Close the WebDriver
driver.quit()

看看我在google-colab上得到了什么:

TypeError                                 Traceback (most recent call last)

<ipython-input-4-76a7abf92dba> in <cell line: 21>()
     19 
     20 # Create a new instance of the Chrome driver
---> 21 driver = webdriver.Chrome('chromedriver', options=chrome_options)
     22 
     23 # Define function to scrape bank data using Selenium

TypeError: WebDriver.__init__() got multiple values for argument 'options'

更新:顺便说一下,如果我们想收集所有国家的数据,可以这样做:

"http://thebanks.eu/search?bank=&country=Albania"
"http://thebanks.eu/search?bank=&country=Andorra"
"http://thebanks.eu/search?bank=&country=Anguilla"
"http://thebanks.eu/search?bank=&country=Austria"
"http://thebanks.eu/search?bank=&country=Belgium"
"http://thebanks.eu/search?bank=&country=Bermuda"
"http://thebanks.eu/search?bank=&country=Bosnia and Herzegovina"
"http://thebanks.eu/search?bank=&country=British Virgin Islands"
"http://thebanks.eu/search?bank=&country=Bulgaria"
"http://thebanks.eu/search?bank=&country=Cayman Islands"
"http://thebanks.eu/search?bank=&country=Croatia"
"http://thebanks.eu/search?bank=&country=Curacao"
"http://thebanks.eu/search?bank=&country=Cyprus"
"http://thebanks.eu/search?bank=&country=Czech Republic"
"http://thebanks.eu/search?bank=&country=Denmark"
"http://thebanks.eu/search?bank=&country=Estonia"
"http://thebanks.eu/search?bank=&country=Finland"
"http://thebanks.eu/search?bank=&country=France"
"http://thebanks.eu/search?bank=&country=Georgia"
"http://thebanks.eu/search?bank=&country=Germany"
"http://thebanks.eu/search?bank=&country=Gibraltar"
"http://thebanks.eu/search?bank=&country=Greece"
"http://thebanks.eu/search?bank=&country=Guernsey"
"http://thebanks.eu/search?bank=&country=Hungary"
"http://thebanks.eu/search?bank=&country=Iceland"
"http://thebanks.eu/search?bank=&country=Ireland"
"http://thebanks.eu/search?bank=&country=Isle of Man"
"http://thebanks.eu/search?bank=&country=Italy"
"http://thebanks.eu/search?bank=&country=Jersey"
"http://thebanks.eu/search?bank=&country=Latvia"
"http://thebanks.eu/search?bank=&country=Liechtenstein"
"http://thebanks.eu/search?bank=&country=Lithuania"
"http://thebanks.eu/search?bank=&country=Luxembourg"
"http://thebanks.eu/search?bank=&country=Macedonia"
"http://thebanks.eu/search?bank=&country=Malta"
"http://thebanks.eu/search?bank=&country=Monaco"
"http://thebanks.eu/search?bank=&country=Montenegro"
"http://thebanks.eu/search?bank=&country=Netherlands"
"http://thebanks.eu/search?bank=&country=Norway"
"http://thebanks.eu/search?bank=&country=Poland"
"http://thebanks.eu/search?bank=&country=Portugal"
"http://thebanks.eu/search?bank=&country=Romania"
"http://thebanks.eu/search?bank=&country=San Marino"
"http://thebanks.eu/search?bank=&country=Serbia"
"http://thebanks.eu/search?bank=&country=Slovakia"
"http://thebanks.eu/search?bank=&country=Slovenia"
"http://thebanks.eu/search?bank=&country=Spain"
"http://thebanks.eu/search?bank=&country= Sweden"
"http://thebanks.eu/search?bank=&country=Switzerland"
"http://thebanks.eu/search?bank=&country=Turkey"
"http://thebanks.eu/search?bank=&country=Turks and Caicos Islands"
"http://thebanks.eu/search?bank=&country=Ukraine"
"http://thebanks.eu/search?bank=&country=United Kingdom"

1 个回答

1

这个网站是cloudflare保护的,所以最好使用代理来绕过这个保护。

import requests
from bs4 import BeautifulSoup
from lxml import etree
import pandas as pd
from pdb import set_trace
from urllib.parse import urlencode
import json

# Get your own api_key from scrapeops or some other proxy vendor
API_KEY = "api_key"
def get_scrapeops_url(url):
    payload = {'api_key': API_KEY, 'url': url}
    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
    return proxy_url

# Function to scrape bank data from my URL
def scrape_bank_data(url):
    proxy_url = get_scrapeops_url(url)
    response = requests.get(proxy_url)
    soup = BeautifulSoup(response.content, "html.parser")
    dom = etree.HTML(str(soup))

    # here we try to extract contact details
    contact_details = []
    contacts_nodes = dom.xpath("//img[contains(@src,'/contacts/')]/following-sibling::span")
    for contact in contacts_nodes:
        contact_str = contact.text
        # Web site link is inside 'a' tag hence using some conditions
        if not contact_str:
            contact_str = contact.xpath(".//a/@href")[0]
            # email is availble inside 'a' tag but it is returning email-protection url instead of email hence taking it from a json script
            if (contact_str and contact_str.count("email") > 0):
                json_str = dom.xpath("//script[contains(@type,'application') and contains(text(),'BankOrCreditUnion')]")[0].text
                data_dict = json.loads(json_str)
                contact_str = data_dict["email"]
        contact_details.append(contact_str.strip())

    return ", ".join(contact_details)

# the list of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search?bank=&country=Albania",
    # "https://thebanks.eu/search?bank=&country=Andorra",
    #  we could add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
    proxy_url = get_scrapeops_url(url)
    response = requests.get(proxy_url)
    soup = BeautifulSoup(response.content, "html.parser")
    dom = etree.HTML(str(soup))
    bank_details = dom.xpath("//div[contains(@class,'products')]/div[contains(@class,'product')]")

    for bank in bank_details:
        bank_info = {}
        bank_url = bank.xpath(".//div[contains(@class,'title')]/a/@href")[0].strip()
        bank_name = bank.xpath(".//div[contains(@class,'title')]/a")[0].text.strip()
        country = bank.xpath(".//span[contains(text(),'Country')]/following::div/text()")[0].strip()
        bank_info = {"Bank Name": bank_name, "Country": country, "Website": bank_url}
        contacts = scrape_bank_data(bank_url)
        bank_info["Contacts"] = contacts
        print(bank_info)
        bank_data.append(bank_info)

#  and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# subsequently we print the DataFrame
print(df)

输出:

                              Bank Name  Country                          Website                                           Contacts
0             Alpha Bank - Albania S.A.  Albania  https://thebanks.eu/banks/19331  Street of Kavaja, G - KAM Business Center, 2 f...
1     American Bank of Investments S.A.  Albania  https://thebanks.eu/banks/19332  Street of Kavaja, Nr. 59, Tirana Tower, Tirana...
2                       Bank of Albania  Albania  https://thebanks.eu/banks/19343  Sheshi “Skënderbej“, No. 1, Tirana, Albania, +...
3       Banka Kombetare Tregstare SH.A.  Albania  https://thebanks.eu/banks/19336  Rruga e Vilave, Lundër 1, 1045, Tirana, Albani...
4                     Credins Bank S.A.  Albania  https://thebanks.eu/banks/19333  Municipal Borough no. 5, street "Vaso Pasha", ...
5   First Investment Bank, Albania S.A.  Albania  https://thebanks.eu/banks/19334  Blv., Tirana, Albania, +355 4 2276 702, +355 4...
6     Intesa Sanpaolo Bank Albania S.A.  Albania  https://thebanks.eu/banks/19335  Street “Ismail Qemali”, No. 27, Tirana, Albani...
7                  OTP Bank Albania S.A  Albania  https://thebanks.eu/banks/19337  Boulevard "Dëshmorët e Kombit", Twin Towers, B...
8                   Procredit Bank S.A.  Albania  https://thebanks.eu/banks/19338  Street "Dritan Hoxha", Nd. 92, H. 15, Municipa...
9                  Raiffeisen Bank S.A.  Albania  https://thebanks.eu/banks/19339  Blv., Tirana, Albania, +355 4 2274 910, +355 4...
10                     Tirana Bank S.A.  Albania  https://thebanks.eu/banks/19340  Street, Tirana, Albania, 2269 616, 2233 417, h...
11                      Union Bank S.A.  Albania  https://thebanks.eu/banks/19341  Blv. "Zogu I", 13 floor building, in front of ...
12          United Bank of Albania S.A.  Albania  https://thebanks.eu/banks/19342  Municipal Borough nr. 7, street, 1023, Tirana,...

如果你只想用selenium的话,无头浏览器undetected_chrome在这里是没用的。这两者都会被Cloudflare屏蔽。如果你在自己的电脑上用本地浏览器运行的话,是可以正常工作的。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
from lxml import etree

# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222')  # Add this option
chrome_options.page_load_strategy = 'eager'


# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(url)
    time.sleep(5)  # Let the page load completely
    html = driver.page_source
    dom = etree.HTML(str(html))
    driver.quit()

    # here we try to extract contact details
    contact_details = []
    contacts_nodes = dom.xpath("//img[contains(@src,'/contacts/')]/following-sibling::span")
    for contact in contacts_nodes:
        contact_str = contact.text
        # Web site link is inside 'a' tag hence using some conditions
        if not contact_str:
            contact_str = contact.xpath(".//a/@href")[0]
            # email is availble inside 'a' tag but it is returning email-protection url instead of email hence taking it from a json script
            if (contact_str and contact_str.count("email") > 0):
                json_str = dom.xpath("//script[contains(@type,'application') and contains(text(),'BankOrCreditUnion')]")[0].text
                data_dict = json.loads(json_str)
                contact_str = data_dict["email"]
        contact_details.append(contact_str.strip())

    return ", ".join(contact_details)

# List of URLs for scraping bank data by country
urls = [
    "https://thebanks.eu/search?bank=&country=Albania",
    # "https://thebanks.eu/search?bank=&country=Andorra",
    # Add more URLs for other countries as needed
]

# List to store bank data
bank_data = []

# Iterate through the URLs and scrape bank data
for url in urls:
    # Create a new instance of the Chrome driver
    driver = webdriver.Chrome(options=chrome_options)

    driver.get(url)
    time.sleep(5)  # Let the page load completely
    html = driver.page_source
    dom = etree.HTML(str(html))
    # Close the WebDriver
    driver.quit()

    bank_details = dom.xpath("//div[contains(@class,'products')]/div[contains(@class,'product')]")
    for bank in bank_details:
        bank_info = {}
        bank_url = bank.xpath(".//div[contains(@class,'title')]/a/@href")[0].strip()
        bank_name = bank.xpath(".//div[contains(@class,'title')]/a")[0].text.strip()
        country = bank.xpath(".//span[contains(text(),'Country')]/following::div/text()")[0].strip()
        bank_info = {"Bank Name": bank_name, "Country": country, "Website": bank_url}
        contacts = scrape_bank_data_with_selenium(bank_url)
        bank_info["Contacts"] = contacts
        print(bank_info)
        bank_data.append(bank_info)
        time.sleep(1)

# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)

# Print the DataFrame
print(df)

输出:

                              Bank Name  Country                          Website                                           Contacts
0             Alpha Bank - Albania S.A.  Albania  https://thebanks.eu/banks/19331  Street of Kavaja, G - KAM Business Center, 2 f...
1     American Bank of Investments S.A.  Albania  https://thebanks.eu/banks/19332  Street of Kavaja, Nr. 59, Tirana Tower, Tirana...
2                       Bank of Albania  Albania  https://thebanks.eu/banks/19343  Sheshi “Skënderbej“, No. 1, Tirana, Albania, +...
3       Banka Kombetare Tregstare SH.A.  Albania  https://thebanks.eu/banks/19336  Rruga e Vilave, Lundër 1, 1045, Tirana, Albani...
4                     Credins Bank S.A.  Albania  https://thebanks.eu/banks/19333  Municipal Borough no. 5, street "Vaso Pasha", ...
5   First Investment Bank, Albania S.A.  Albania  https://thebanks.eu/banks/19334  Blv., Tirana, Albania, +355 4 2276 702, +355 4...
6     Intesa Sanpaolo Bank Albania S.A.  Albania  https://thebanks.eu/banks/19335  Street “Ismail Qemali”, No. 27, Tirana, Albani...
7                  OTP Bank Albania S.A  Albania  https://thebanks.eu/banks/19337  Boulevard "Dëshmorët e Kombit", Twin Towers, B...
8                   Procredit Bank S.A.  Albania  https://thebanks.eu/banks/19338  Street "Dritan Hoxha", Nd. 92, H. 15, Municipa...
9                  Raiffeisen Bank S.A.  Albania  https://thebanks.eu/banks/19339  Blv., Tirana, Albania, +355 4 2274 910, +355 4...
10                     Tirana Bank S.A.  Albania  https://thebanks.eu/banks/19340  Street, Tirana, Albania, 2269 616, 2233 417, h...
11                      Union Bank S.A.  Albania  https://thebanks.eu/banks/19341  Blv. "Zogu I", 13 floor building, in front of ...
12          United Bank of Albania S.A.  Albania  https://thebanks.eu/banks/19342  Municipal Borough nr. 7, street, 1023, Tirana,...

撰写回答