使用BeautifulSoup从网页中提取文本

2条回答

网友

1楼 · 编辑于 2024-05-29 03:18:36

问题是id一直在动态变化。否则的话，我就用这个了，但不行。假设输出值就是您所要寻找的，这应该是可行的，只要内容没有改变或改变

from bs4 import BeautifulSoup as bs
import requests

url = 'https://markets.cboe.com/europe/equities/market_share/index/all/'
page = requests.get(url)
html = bs(page.text, 'lxml')
total_volume = html.findAll('td', class_='idx_val')
print(total_volume[645].text)

Output:
€4,378,517,621

网友

2楼 · 编辑于 2024-05-29 03:18:36

我建议给熊猫html阅读器一个机会：

import pandas as pd

# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')

# Grab the second table founds
df = results[1]
# Set the first column as the index
df = df.set_index(0)
# Switch columns and indexes
df = df.T
# Drop any columns that have no data in them
df = df.dropna(how='all', axis=1)
# Set the column under "Displayed Price Venues" as the index
df = df.set_index('Displayed Price Venues')
# Switch columns and indexes again
df = df.T

# Aesthetic. Don't like having an index name myself! 
del df.index.name

# Separate the three subtables from each other!  
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]

您还可以以更积极紧凑的方式来实现这一点（相同的代码，但不需要分解步骤）：

import pandas as pd

# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')

# Do all the stuff above in one go
df = results[1].set_index(0).T.dropna(how='all',axis=1).set_index('Displayed Price Venues').T

# Aesthetic. Don't like having an index name myself! 
del df.index.name

# Separate the three subtables from each other!  
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用BeautifulSoup从网页中提取文本

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >