使用Selenium和BS4进行抓取
我正在尝试从这个网站上抓取一个表格,作为练习 - https://stats.paj.gr.jp/en/pub/current_en_n2.html
我遇到的问题是,无法打印出完整的表格。现在只返回了表格中的一个单元格。如果有好心人能提供一些指导,我将非常感激。
我的代码如下
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
path = "C:\\Users\Jun Hui\\Desktop\\Quant Trading\\chromedriver-win32\\chromedriver.exe"
service = webdriver.chrome.service.Service(path)
service.start()
url = "https://stats.paj.gr.jp/en/pub/current_en_n2.html"
driver = webdriver.Chrome(service=service)
driver.get(url)
link_element1 = driver.find_element(By.XPATH,"//a[@href='index.html']")
link_element1.click()
link_element2 = driver.find_element(By.XPATH,"//a[@href='./current_en_n2.html']")
link_element2.click()
page_source = driver.page_source
soup = BeautifulSoup(page_source, "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
for row in rows:
cells = row.find_all("td")
for cell in cells:
print(cell.text.strip(), end="\t")
print()
2 个回答
0
有时候,当你请求页面信息时,表格还没有加载出来。这时候你应该使用一些等待策略,来减少这种情况的发生。
比如,你可以先检查一下表格元素是否可见(也就是说表格已经加载出来了),然后再在这个表格元素里找带有tr标签的元素,这样就能得到一个表格行元素的列表。接着,你只需要遍历这个列表,提取你需要的信息就可以了。
table = wait.Until(ExpectedConditions.ElementIsVisible(By.ClassName("table-class-name")));
table_rows = Table.findElements(By.tagName("tr"))
for row in table_rows:
#loop through table row to extract data
这只是一个大概念,帮助你理解如何实现这个过程。
1
这里有一个例子,教你怎么从一个网址获取所有的表格,并把它们变成 pandas 数据框:
from io import StringIO
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://stats.paj.gr.jp/en/pub/current_en_n2.html"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:123.0) Gecko/20100101 Firefox/123.0",
"Referer": "https://stats.paj.gr.jp/en/pub/index.html",
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
for table in soup.select("table:not(:has(table))"):
for td in table.tr.select("td"):
td.name = "th"
df = pd.read_html(StringIO(str(table)))[0]
if not df.empty:
u = table.find_previous("u").text
print(u)
print()
print(df)
print("-" * 80)
输出结果是:
--------------------------------------------------------------------------------
1.Refinery Operations
Unnamed: 0 Unnamed: 1 Current Week 18/Feb/2024-24/Feb Last Week 11/Feb/2024-17/Feb Change from Last Week Unnamed: 5 Unnamed: 6
0 Crude Input(KL) Crude Input(KL) 2765740 2655588 110152.0 NaN NaN
1 Topper Unit Capacity Topper Unit Capacity NaN NaN NaN NaN NaN
2 NaN Weekly Average Capacity(BPSD) 2839300 2731443 107857.0 NaN NaN
3 NaN Util. Rate against BPSD 87.5% 87.4% NaN NaN NaN
4 NaN Designed Capacity(BPCD) 3230400 3230400 0.0 NaN NaN
5 NaN Util. Rate against BPCD 76.9% 73.9% NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN
--------------------------------------------------------------------------------
2.Products Stocks(kl)
Unnamed: 0 Unnamed: 1 Current Week 18/Feb/2024-24/Feb Last Week 11/Feb/2024-17/Feb Change from Last Week
0 Gasoline NaN 1803404 1802295 1109
1 Naphtha NaN 1241808 1202555 39253
2 Jet NaN 750427 755151 -4724
3 Kerosene NaN 1672932 1602863 70069
4 Gas Oil(Diesel) NaN 1576961 1552293 24668
5 LSA Sul under 0.1% 335027 298286 36741
6 HSA Sul over 0.1% 421710 409817 11893
7 AFO NaN 756737 708103 48634
8 LSC Sul under 0.5% 676670 688519 -11849
9 HSC Sul over 0.5% 1167900 1169686 -1786
10 CFO NaN 1844570 1858205 -13635
11 Total NaN 9646839 9481465 165374
--------------------------------------------------------------------------------
...