使用BeautifulSoup从网页中提取文本

2024-05-29 03:18:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用Python从https://markets.cboe.com/europe/equities/market_share/index/all/中提取一些数据

特别是“市场未显示总量”的数字,我尝试了几种使用BeautifulSoup的方法,但似乎没有一种方法能让我达到目的

有什么想法吗


Tags: 数据方法httpscomshareindex市场数字
2条回答

问题是id一直在动态变化。否则的话,我就用这个了,但不行。假设输出值就是您所要寻找的,这应该是可行的,只要内容没有改变或改变

from bs4 import BeautifulSoup as bs
import requests

url = 'https://markets.cboe.com/europe/equities/market_share/index/all/'
page = requests.get(url)
html = bs(page.text, 'lxml')
total_volume = html.findAll('td', class_='idx_val')
print(total_volume[645].text)

Output:
€4,378,517,621

我建议给熊猫html阅读器一个机会:

import pandas as pd

# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')

# Grab the second table founds
df = results[1]
# Set the first column as the index
df = df.set_index(0)
# Switch columns and indexes
df = df.T
# Drop any columns that have no data in them
df = df.dropna(how='all', axis=1)
# Set the column under "Displayed Price Venues" as the index
df = df.set_index('Displayed Price Venues')
# Switch columns and indexes again
df = df.T

# Aesthetic. Don't like having an index name myself! 
del df.index.name

# Separate the three subtables from each other!  
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]

您还可以以更积极紧凑的方式来实现这一点(相同的代码,但不需要分解步骤):

import pandas as pd

# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')

# Do all the stuff above in one go
df = results[1].set_index(0).T.dropna(how='all',axis=1).set_index('Displayed Price Venues').T

# Aesthetic. Don't like having an index name myself! 
del df.index.name

# Separate the three subtables from each other!  
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]

相关问题 更多 >

    热门问题