使用Selenium和BS4进行抓取

1 投票
2 回答
46 浏览
提问于 2025-04-14 18:09

我正在尝试从这个网站上抓取一个表格,作为练习 - https://stats.paj.gr.jp/en/pub/current_en_n2.html

我遇到的问题是,无法打印出完整的表格。现在只返回了表格中的一个单元格。如果有好心人能提供一些指导,我将非常感激。

我的代码如下

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC 
from bs4 import BeautifulSoup
import pandas as pd

path = "C:\\Users\Jun Hui\\Desktop\\Quant Trading\\chromedriver-win32\\chromedriver.exe"

service = webdriver.chrome.service.Service(path)
service.start()

url = "https://stats.paj.gr.jp/en/pub/current_en_n2.html"
driver = webdriver.Chrome(service=service)
driver.get(url)

link_element1 = driver.find_element(By.XPATH,"//a[@href='index.html']")
link_element1.click()
link_element2 = driver.find_element(By.XPATH,"//a[@href='./current_en_n2.html']")
link_element2.click()

page_source = driver.page_source
soup = BeautifulSoup(page_source, "html.parser")

table = soup.find("table")
rows = table.find_all("tr")
for row in rows:
    cells = row.find_all("td")
    for cell in cells:
        print(cell.text.strip(), end="\t")
    print()

2 个回答

0

有时候,当你请求页面信息时,表格还没有加载出来。这时候你应该使用一些等待策略,来减少这种情况的发生。

比如,你可以先检查一下表格元素是否可见(也就是说表格已经加载出来了),然后再在这个表格元素里找带有tr标签的元素,这样就能得到一个表格行元素的列表。接着,你只需要遍历这个列表,提取你需要的信息就可以了。

table = wait.Until(ExpectedConditions.ElementIsVisible(By.ClassName("table-class-name")));

table_rows = Table.findElements(By.tagName("tr"))

for row in table_rows:
 #loop through table row to extract data

这只是一个大概念,帮助你理解如何实现这个过程。

1

这里有一个例子,教你怎么从一个网址获取所有的表格,并把它们变成 pandas 数据框:

from io import StringIO

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://stats.paj.gr.jp/en/pub/current_en_n2.html"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:123.0) Gecko/20100101 Firefox/123.0",
    "Referer": "https://stats.paj.gr.jp/en/pub/index.html",
}

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

for table in soup.select("table:not(:has(table))"):
    for td in table.tr.select("td"):
        td.name = "th"
    df = pd.read_html(StringIO(str(table)))[0]
    if not df.empty:
        u = table.find_previous("u").text
        print(u)
        print()
        print(df)
    print("-" * 80)

输出结果是:

--------------------------------------------------------------------------------                                                                                                                                   
1.Refinery Operations                                                                                    
                                                    
             Unnamed: 0                     Unnamed: 1 Current Week 18/Feb/2024-24/Feb Last Week 11/Feb/2024-17/Feb  Change from Last Week  Unnamed: 5  Unnamed: 6
0       Crude Input(KL)                Crude Input(KL)                         2765740                      2655588               110152.0         NaN         NaN
1  Topper Unit Capacity           Topper Unit Capacity                             NaN                          NaN                    NaN         NaN         NaN
2                   NaN  Weekly Average Capacity(BPSD)                         2839300                      2731443               107857.0         NaN         NaN
3                   NaN        Util. Rate against BPSD                           87.5%                        87.4%                    NaN         NaN         NaN
4                   NaN        Designed Capacity(BPCD)                         3230400                      3230400                    0.0         NaN         NaN
5                   NaN        Util. Rate against BPCD                           76.9%                        73.9%                    NaN         NaN         NaN
6                   NaN                            NaN                             NaN                          NaN                    NaN         NaN         NaN
7                   NaN                            NaN                             NaN                          NaN                    NaN         NaN         NaN
--------------------------------------------------------------------------------                                                                                                                                   
2.Products Stocks(kl)                                                                                                                                                                                            
                                                                                                                                                                                                                   
         Unnamed: 0      Unnamed: 1  Current Week 18/Feb/2024-24/Feb  Last Week 11/Feb/2024-17/Feb  Change from Last Week
0          Gasoline             NaN                          1803404                       1802295                   1109
1           Naphtha             NaN                          1241808                       1202555                  39253
2               Jet             NaN                           750427                        755151                  -4724
3          Kerosene             NaN                          1672932                       1602863                  70069
4   Gas Oil(Diesel)             NaN                          1576961                       1552293                  24668
5               LSA  Sul under 0.1%                           335027                        298286                  36741
6               HSA   Sul over 0.1%                           421710                        409817                  11893
7               AFO             NaN                           756737                        708103                  48634
8               LSC  Sul under 0.5%                           676670                        688519                 -11849
9               HSC   Sul over 0.5%                          1167900                       1169686                  -1786
10              CFO             NaN                          1844570                       1858205                 -13635
11            Total             NaN                          9646839                       9481465                 165374
--------------------------------------------------------------------------------                                                                                                                                   

...

撰写回答