BeautifulSoup4和Pandas返回空DataFrame列:更新:现在在Google-Colab上使用Selenium
我在找一个全球银行的公开列表。
我不需要分支机构和完整地址,只要银行的名字和网站就可以了。我想要的数据格式可以是XML、CSV等,包含以下字段:银行名称、国家名称或国家代码(ISO两字母),网站;可选的还有银行总部所在城市。每个银行在每个国家的记录只需要一条。顺便提一下:特别是小银行也很有意思。
我找到一个非常全面的页面——它有9000家欧洲的银行:
从A到Z查看:
**A**
https://thebanks.eu/search?bank=&country=Albania
https://thebanks.eu/search?bank=&country=Andorra
https://thebanks.eu/search?bank=&country=Anguilla
**B**
https://thebanks.eu/search?bank=&country=Belgium
**U**
https://thebanks.eu/search?bank=&country=Ukraine
https://thebanks.eu/search?bank=&country=United+Kingdom
查看详细页面:https://thebanks.eu/banks/9563
我需要这些数据。
联系方式Mitteldorfstrasse 48, 9524, Zuzwil SG, Switzerland
071 944 15 51071 944 27 52
https://www.bankbiz.ch/
我的方法是使用bs4、requests和pandas。
顺便说一下:也许我们可以从零数到100,000,以便获取数据库中存储的所有银行:
查看详细页面:https://thebanks.eu/banks/9563
我在Colab上运行了这个:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Function to scrape bank data from my URL
def scrape_bank_data(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# here we try to find bank name, country, and website
bank_name = soup.find("h1", class_="entry-title").text.strip()
country = soup.find("span", class_="country-name").text.strip()
website = soup.find("a", class_="site-url").text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")
return {"Bank Name": bank_name, "Country": country, "Website": website}
# the list of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# we could add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# Iterate through the URLs and scrape bank data
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
bank_links = soup.find_all("div", class_="search-bank")
for bank_link in bank_links:
bank_url = "https://thebanks.eu" + bank_link.find("a").get("href")
bank_info = scrape_bank_data(bank_url)
bank_data.append(bank_info)
# and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# subsequently we print the DataFrame
print(df)
看看返回了什么。
Empty DataFrame
Columns: []
Index: []
嗯,我觉得抓取过程有问题。我尝试了不同的方法,一遍又一遍地检查网页上的元素,以确保我提取了正确的信息。
我还应该打印一些额外的调试信息,以帮助诊断问题。
更新:晚上好,亲爱的@Asish M.和@eternal_white,感谢你们的评论和分享的想法:值得思考的内容:关于Selenium,我觉得这是个好主意——在Google Colab上运行它(Selenium),我从Jacob Padilla那里学到了。
@Jacob / @user:21216449 :: 查看Jacob的页面:https://github.com/jpjacobpadilla,以及Google Colab的Selenium:https://github.com/jpjacobpadilla/Google-Colab-Selenium和默认选项:
The google-colab-selenium package is preconfigured with a set of default options optimized for Google Colab environments. These defaults include:
• --headless: Runs Chrome in headless mode (without a GUI).
• --no-sandbox: Disables the Chrome sandboxing feature, necessary in the Colab environment.
• --disable-dev-shm-usage: Prevents issues with limited shared memory in Docker containers.
• --lang=en: Sets the language to English.
我认为这个方法值得考虑:我们可以这样做:
在Google Colab中使用Selenium来绕过Cloudflare(你提到的,eternal_white)的阻挡,抓取所需的数据是一个不错但可行的方法。这里有一些关于逐步方法的想法,以及如何使用Jacob Padilla的google-colab-selenium包进行设置:
Install google-colab-selenium:
You can install the google-colab-selenium package using pip:
diff
!pip install google-colab-selenium
我们还需要安装Selenium:
差异
!pip install selenium
Import Necessary Libraries:
Import the required libraries in your Colab notebook:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from google.colab import output
import time
然后我们需要设置Selenium WebDriver:配置Chrome WebDriver和必要的选项:
# Set up options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=options)
在这里我们将定义抓取的函数:我们定义一个使用Selenium抓取银行数据的函数:
def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5) # first of all - we let the page load completely
bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")
return {"Bank Name": bank_name, "Country": country, "Website": website}
然后我们可以去抓取数据:现在我们可以使用定义的函数抓取数据:
# List of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# hmm - we could add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# now we can iterate through the URLs and scrape bank data
for url in urls:
bank_data.append(scrape_bank_data_with_selenium(url))
# and now we can convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# Print the DataFrame
print(df)
而且——一次性完成:
# first of all we need to install all the required packages - for example the Packages of Jakobs Selenium approach etc etx:
!pip install google-colab-selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
# and afterwards we need to import all the necessary libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time
# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222') # Add this option
# Create a new instance of the Chrome driver
driver = webdriver.Chrome('chromedriver', options=chrome_options)
# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
driver.get(url)
time.sleep(5) # Let the page load completely
bank_name = driver.find_element(By.CLASS_NAME, 'entry-title').text.strip()
country = driver.find_element(By.CLASS_NAME, 'country-name').text.strip()
website = driver.find_element(By.CLASS_NAME, 'site-url').text.strip()
print(f"Scraped: {bank_name}, {country}, {website}")
return {"Bank Name": bank_name, "Country": country, "Website": website}
# List of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search",
"https://thebanks.eu/search?bank=&country=Albania",
"https://thebanks.eu/search?bank=&country=Andorra",
# Add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# Iterate through the URLs and scrape bank data
for url in urls:
bank_data.append(scrape_bank_data_with_selenium(url))
# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# Print the DataFrame
print(df)
# Close the WebDriver
driver.quit()
看看我在google-colab上得到了什么:
TypeError Traceback (most recent call last)
<ipython-input-4-76a7abf92dba> in <cell line: 21>()
19
20 # Create a new instance of the Chrome driver
---> 21 driver = webdriver.Chrome('chromedriver', options=chrome_options)
22
23 # Define function to scrape bank data using Selenium
TypeError: WebDriver.__init__() got multiple values for argument 'options'
更新:顺便说一下,如果我们想收集所有国家的数据,可以这样做:
"http://thebanks.eu/search?bank=&country=Albania"
"http://thebanks.eu/search?bank=&country=Andorra"
"http://thebanks.eu/search?bank=&country=Anguilla"
"http://thebanks.eu/search?bank=&country=Austria"
"http://thebanks.eu/search?bank=&country=Belgium"
"http://thebanks.eu/search?bank=&country=Bermuda"
"http://thebanks.eu/search?bank=&country=Bosnia and Herzegovina"
"http://thebanks.eu/search?bank=&country=British Virgin Islands"
"http://thebanks.eu/search?bank=&country=Bulgaria"
"http://thebanks.eu/search?bank=&country=Cayman Islands"
"http://thebanks.eu/search?bank=&country=Croatia"
"http://thebanks.eu/search?bank=&country=Curacao"
"http://thebanks.eu/search?bank=&country=Cyprus"
"http://thebanks.eu/search?bank=&country=Czech Republic"
"http://thebanks.eu/search?bank=&country=Denmark"
"http://thebanks.eu/search?bank=&country=Estonia"
"http://thebanks.eu/search?bank=&country=Finland"
"http://thebanks.eu/search?bank=&country=France"
"http://thebanks.eu/search?bank=&country=Georgia"
"http://thebanks.eu/search?bank=&country=Germany"
"http://thebanks.eu/search?bank=&country=Gibraltar"
"http://thebanks.eu/search?bank=&country=Greece"
"http://thebanks.eu/search?bank=&country=Guernsey"
"http://thebanks.eu/search?bank=&country=Hungary"
"http://thebanks.eu/search?bank=&country=Iceland"
"http://thebanks.eu/search?bank=&country=Ireland"
"http://thebanks.eu/search?bank=&country=Isle of Man"
"http://thebanks.eu/search?bank=&country=Italy"
"http://thebanks.eu/search?bank=&country=Jersey"
"http://thebanks.eu/search?bank=&country=Latvia"
"http://thebanks.eu/search?bank=&country=Liechtenstein"
"http://thebanks.eu/search?bank=&country=Lithuania"
"http://thebanks.eu/search?bank=&country=Luxembourg"
"http://thebanks.eu/search?bank=&country=Macedonia"
"http://thebanks.eu/search?bank=&country=Malta"
"http://thebanks.eu/search?bank=&country=Monaco"
"http://thebanks.eu/search?bank=&country=Montenegro"
"http://thebanks.eu/search?bank=&country=Netherlands"
"http://thebanks.eu/search?bank=&country=Norway"
"http://thebanks.eu/search?bank=&country=Poland"
"http://thebanks.eu/search?bank=&country=Portugal"
"http://thebanks.eu/search?bank=&country=Romania"
"http://thebanks.eu/search?bank=&country=San Marino"
"http://thebanks.eu/search?bank=&country=Serbia"
"http://thebanks.eu/search?bank=&country=Slovakia"
"http://thebanks.eu/search?bank=&country=Slovenia"
"http://thebanks.eu/search?bank=&country=Spain"
"http://thebanks.eu/search?bank=&country= Sweden"
"http://thebanks.eu/search?bank=&country=Switzerland"
"http://thebanks.eu/search?bank=&country=Turkey"
"http://thebanks.eu/search?bank=&country=Turks and Caicos Islands"
"http://thebanks.eu/search?bank=&country=Ukraine"
"http://thebanks.eu/search?bank=&country=United Kingdom"
1 个回答
这个网站是cloudflare保护的,所以最好使用代理来绕过这个保护。
import requests
from bs4 import BeautifulSoup
from lxml import etree
import pandas as pd
from pdb import set_trace
from urllib.parse import urlencode
import json
# Get your own api_key from scrapeops or some other proxy vendor
API_KEY = "api_key"
def get_scrapeops_url(url):
payload = {'api_key': API_KEY, 'url': url}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url
# Function to scrape bank data from my URL
def scrape_bank_data(url):
proxy_url = get_scrapeops_url(url)
response = requests.get(proxy_url)
soup = BeautifulSoup(response.content, "html.parser")
dom = etree.HTML(str(soup))
# here we try to extract contact details
contact_details = []
contacts_nodes = dom.xpath("//img[contains(@src,'/contacts/')]/following-sibling::span")
for contact in contacts_nodes:
contact_str = contact.text
# Web site link is inside 'a' tag hence using some conditions
if not contact_str:
contact_str = contact.xpath(".//a/@href")[0]
# email is availble inside 'a' tag but it is returning email-protection url instead of email hence taking it from a json script
if (contact_str and contact_str.count("email") > 0):
json_str = dom.xpath("//script[contains(@type,'application') and contains(text(),'BankOrCreditUnion')]")[0].text
data_dict = json.loads(json_str)
contact_str = data_dict["email"]
contact_details.append(contact_str.strip())
return ", ".join(contact_details)
# the list of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search?bank=&country=Albania",
# "https://thebanks.eu/search?bank=&country=Andorra",
# we could add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# Iterate through the URLs and scrape bank data
for url in urls:
proxy_url = get_scrapeops_url(url)
response = requests.get(proxy_url)
soup = BeautifulSoup(response.content, "html.parser")
dom = etree.HTML(str(soup))
bank_details = dom.xpath("//div[contains(@class,'products')]/div[contains(@class,'product')]")
for bank in bank_details:
bank_info = {}
bank_url = bank.xpath(".//div[contains(@class,'title')]/a/@href")[0].strip()
bank_name = bank.xpath(".//div[contains(@class,'title')]/a")[0].text.strip()
country = bank.xpath(".//span[contains(text(),'Country')]/following::div/text()")[0].strip()
bank_info = {"Bank Name": bank_name, "Country": country, "Website": bank_url}
contacts = scrape_bank_data(bank_url)
bank_info["Contacts"] = contacts
print(bank_info)
bank_data.append(bank_info)
# and now we convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# subsequently we print the DataFrame
print(df)
输出:
Bank Name Country Website Contacts
0 Alpha Bank - Albania S.A. Albania https://thebanks.eu/banks/19331 Street of Kavaja, G - KAM Business Center, 2 f...
1 American Bank of Investments S.A. Albania https://thebanks.eu/banks/19332 Street of Kavaja, Nr. 59, Tirana Tower, Tirana...
2 Bank of Albania Albania https://thebanks.eu/banks/19343 Sheshi “Skënderbej“, No. 1, Tirana, Albania, +...
3 Banka Kombetare Tregstare SH.A. Albania https://thebanks.eu/banks/19336 Rruga e Vilave, Lundër 1, 1045, Tirana, Albani...
4 Credins Bank S.A. Albania https://thebanks.eu/banks/19333 Municipal Borough no. 5, street "Vaso Pasha", ...
5 First Investment Bank, Albania S.A. Albania https://thebanks.eu/banks/19334 Blv., Tirana, Albania, +355 4 2276 702, +355 4...
6 Intesa Sanpaolo Bank Albania S.A. Albania https://thebanks.eu/banks/19335 Street “Ismail Qemali”, No. 27, Tirana, Albani...
7 OTP Bank Albania S.A Albania https://thebanks.eu/banks/19337 Boulevard "Dëshmorët e Kombit", Twin Towers, B...
8 Procredit Bank S.A. Albania https://thebanks.eu/banks/19338 Street "Dritan Hoxha", Nd. 92, H. 15, Municipa...
9 Raiffeisen Bank S.A. Albania https://thebanks.eu/banks/19339 Blv., Tirana, Albania, +355 4 2274 910, +355 4...
10 Tirana Bank S.A. Albania https://thebanks.eu/banks/19340 Street, Tirana, Albania, 2269 616, 2233 417, h...
11 Union Bank S.A. Albania https://thebanks.eu/banks/19341 Blv. "Zogu I", 13 floor building, in front of ...
12 United Bank of Albania S.A. Albania https://thebanks.eu/banks/19342 Municipal Borough nr. 7, street, 1023, Tirana,...
如果你只想用selenium的话,无头浏览器
和undetected_chrome
在这里是没用的。这两者都会被Cloudflare屏蔽。如果你在自己的电脑上用本地浏览器运行的话,是可以正常工作的。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
from lxml import etree
# Set up options for Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--remote-debugging-port=9222') # Add this option
chrome_options.page_load_strategy = 'eager'
# Define function to scrape bank data using Selenium
def scrape_bank_data_with_selenium(url):
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
time.sleep(5) # Let the page load completely
html = driver.page_source
dom = etree.HTML(str(html))
driver.quit()
# here we try to extract contact details
contact_details = []
contacts_nodes = dom.xpath("//img[contains(@src,'/contacts/')]/following-sibling::span")
for contact in contacts_nodes:
contact_str = contact.text
# Web site link is inside 'a' tag hence using some conditions
if not contact_str:
contact_str = contact.xpath(".//a/@href")[0]
# email is availble inside 'a' tag but it is returning email-protection url instead of email hence taking it from a json script
if (contact_str and contact_str.count("email") > 0):
json_str = dom.xpath("//script[contains(@type,'application') and contains(text(),'BankOrCreditUnion')]")[0].text
data_dict = json.loads(json_str)
contact_str = data_dict["email"]
contact_details.append(contact_str.strip())
return ", ".join(contact_details)
# List of URLs for scraping bank data by country
urls = [
"https://thebanks.eu/search?bank=&country=Albania",
# "https://thebanks.eu/search?bank=&country=Andorra",
# Add more URLs for other countries as needed
]
# List to store bank data
bank_data = []
# Iterate through the URLs and scrape bank data
for url in urls:
# Create a new instance of the Chrome driver
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
time.sleep(5) # Let the page load completely
html = driver.page_source
dom = etree.HTML(str(html))
# Close the WebDriver
driver.quit()
bank_details = dom.xpath("//div[contains(@class,'products')]/div[contains(@class,'product')]")
for bank in bank_details:
bank_info = {}
bank_url = bank.xpath(".//div[contains(@class,'title')]/a/@href")[0].strip()
bank_name = bank.xpath(".//div[contains(@class,'title')]/a")[0].text.strip()
country = bank.xpath(".//span[contains(text(),'Country')]/following::div/text()")[0].strip()
bank_info = {"Bank Name": bank_name, "Country": country, "Website": bank_url}
contacts = scrape_bank_data_with_selenium(bank_url)
bank_info["Contacts"] = contacts
print(bank_info)
bank_data.append(bank_info)
time.sleep(1)
# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(bank_data)
# Print the DataFrame
print(df)
输出:
Bank Name Country Website Contacts
0 Alpha Bank - Albania S.A. Albania https://thebanks.eu/banks/19331 Street of Kavaja, G - KAM Business Center, 2 f...
1 American Bank of Investments S.A. Albania https://thebanks.eu/banks/19332 Street of Kavaja, Nr. 59, Tirana Tower, Tirana...
2 Bank of Albania Albania https://thebanks.eu/banks/19343 Sheshi “Skënderbej“, No. 1, Tirana, Albania, +...
3 Banka Kombetare Tregstare SH.A. Albania https://thebanks.eu/banks/19336 Rruga e Vilave, Lundër 1, 1045, Tirana, Albani...
4 Credins Bank S.A. Albania https://thebanks.eu/banks/19333 Municipal Borough no. 5, street "Vaso Pasha", ...
5 First Investment Bank, Albania S.A. Albania https://thebanks.eu/banks/19334 Blv., Tirana, Albania, +355 4 2276 702, +355 4...
6 Intesa Sanpaolo Bank Albania S.A. Albania https://thebanks.eu/banks/19335 Street “Ismail Qemali”, No. 27, Tirana, Albani...
7 OTP Bank Albania S.A Albania https://thebanks.eu/banks/19337 Boulevard "Dëshmorët e Kombit", Twin Towers, B...
8 Procredit Bank S.A. Albania https://thebanks.eu/banks/19338 Street "Dritan Hoxha", Nd. 92, H. 15, Municipa...
9 Raiffeisen Bank S.A. Albania https://thebanks.eu/banks/19339 Blv., Tirana, Albania, +355 4 2274 910, +355 4...
10 Tirana Bank S.A. Albania https://thebanks.eu/banks/19340 Street, Tirana, Albania, 2269 616, 2233 417, h...
11 Union Bank S.A. Albania https://thebanks.eu/banks/19341 Blv. "Zogu I", 13 floor building, in front of ...
12 United Bank of Albania S.A. Albania https://thebanks.eu/banks/19342 Municipal Borough nr. 7, street, 1023, Tirana,...