使用BeautifulSoup抓取Billboard Hot 100艺术家单曲历史
我正在尝试从一个艺术家的公告牌页面上抓取所有与他们的单曲及其表现相关的信息。我想要重新实现一个我在别处看到的解决方案。这个方法在某个阶段是有效的,但一旦我到达“最高排名”那一栏,我就不知道怎么把“最高日期”和“周数”从表格中提取出来。我基本上是想把网站上表格中显示的所有信息都抓取下来,最后放到一个数据框里,但就是抓不到最后两列。任何建议都会非常感谢。谢谢!
import requests
from bs4 import BeautifulSoup
url = requests.get('https://www.billboard.com/artist/john-lennon/chart-history/hsi/')
soup = BeautifulSoup(url.content, 'html.parser')
result = soup.find_all('div','o-chart-results-list-row')
for res in result:
song = res.find('h3').text.strip()
artist = res.find('h3').find_next('span').text.strip()
debute = res.find('span').find_next('span').text.strip()
peak = res.find('a').find_next('span').text.strip()
#peak_date = ?
#wks = ?
print("song: "+str(song))
print("artist: "+ str(artist))
print("debute: "+ str(debute))
print("peak: "+ str(peak))
print("___________________________________________________")
歌曲: (Just Like) Starting Over
艺术家: John Lennon
首发: 11.01.80
最高排名: 1
最高日期:
周数:
3 个回答
0
一般来说,有几种方法可以从html文档中获取元素。你可以像之前那样使用链式调用find/find_next。这种方法是可行的,可以用来获取你想要的周数和峰值日期。
peak_date = res.find("a").find_next("a").text.strip()
wks = res.find("a").find_next("a").find_next("span").text.strip()
不过,更好的办法是直接通过类名来查找这些元素。这样即使元素的顺序发生了变化,只要类名不变,你的脚本也能正常工作。它可能看起来像这样:
peak_date = res.find("span", class_="artist-chart-row-peak-date").text.strip()
wks = res.find("span", class_="artist-chart-row-week-on-chart").text.strip()
完整的代码就会是:
import requests
from bs4 import BeautifulSoup
url = requests.get('https://www.billboard.com/artist/john-lennon/chart-history/hsi/')
soup = BeautifulSoup(url.content, 'html.parser')
result = soup.find_all('div','o-chart-results-list-row')
for res in result:
song = res.find('h3').text.strip()
artist = res.find('h3').find_next('span').text.strip()
debute = res.find('span').find_next('span').text.strip()
peak = res.find('a').find_next('span').text.strip()
# Sloppy solution by chaining find_next
# peak_date = res.find("a").find_next("a").text.strip()
# wks = res.find("a").find_next("a").find_next("span").text.strip()
# Better solution by searching for elements with class name
peak_date = res.find("span", class_="artist-chart-row-peak-date").text.strip()
wks = res.find("span", class_="artist-chart-row-week-on-chart").text.strip()
print("song: "+str(song))
print("artist: "+ str(artist))
print("debute: "+ str(debute))
print("peak: "+ str(peak))
print("peak date: " + str(peak_date))
print("weeks: " + str(wks))
print("___________________________________________________")
0
我会查看页面的源代码,看看每一列的位置,并利用类名来获取数据(比如,对于peak_date这个值,你可以在下一个<a>
标签中找到,而对于周数,你可以在下一个<span>
标签中找到,那个标签的类名是"artist-chart-row-week-on-chart")。
获取你想要的数据的完整代码如下:
import requests
from bs4 import BeautifulSoup
url = requests.get('https://www.billboard.com/artist/john-lennon/chart-history/hsi/')
soup = BeautifulSoup(url.content, 'html.parser')
result = soup.find_all('div','o-chart-results-list-row')
for res in result:
song = res.find('h3').text.strip()
artist = res.find('h3').find_next('span').text.strip()
debute = res.find('span').find_next('span').text.strip()
peak = res.find('a').find_next('span').text.strip()
peak_date = res.find('a').find_next('a').text.strip()
wks = res.find_next('span','artist-chart-row-week-on-chart').text.strip()
print("song: "+str(song))
print("artist: "+ str(artist))
print("debute: "+ str(debute))
print("peak: "+ str(peak))
print("peak_date: "+ str(peak_date))
print("wks: "+ str(wks))
print("___________________________________________________")
更新!!
我还会告诉你如何用pandas数据框来实现你想要的效果(最佳做法,可能也是最快的方法,是通过append使用字典列表。此外,我还会提供几个解析的方法,如果你想把日期存储成合适的格式的话):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = requests.get('https://www.billboard.com/artist/john-lennon/chart-history/hsi/')
soup = BeautifulSoup(url.content, 'html.parser')
result = soup.find_all('div','o-chart-results-list-row')
my_dictionary = []
for res in result:
song = res.find('h3').text.strip()
artist = res.find('h3').find_next('span').text.strip()
debute = res.find('span').find_next('span').text.strip()
peak = res.find('a').find_next('span').text.strip()
peak_date = res.find('a').find_next('a').text.strip()
wks = res.find_next('span','artist-chart-row-week-on-chart').text.strip()
my_dictionary.append({"song": song,
"artist": artist,
"debute": pd.to_datetime(debute, format='%m.%d.%y', errors='coerce'),
"peak": peak,
"peak_date": pd.to_datetime(peak_date, format='%m.%d.%y', errors='coerce'),
"wks": wks})
my_dataframe = pd.DataFrame.from_dict(my_dictionary)
1
试试这个:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.billboard.com/artist/john-lennon/chart-history/hsi/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = []
for row in soup.select(".o-chart-results-list-row"):
title = row.h3.get_text(strip=True)
artist = row.span.get_text(strip=True)
debut_date = row.select_one(".artist-chart-row-debut-date").get_text(strip=True)
peak_pos = row.select_one(".artist-chart-row-peak-pos").get_text(strip=True)
peak_week = row.select_one(".artist-chart-row-peak-week").get_text(strip=True)
peak_date = row.select_one(".artist-chart-row-peak-date").get_text(strip=True)
wks_on_chart = row.select_one(".artist-chart-row-week-on-chart").get_text(
strip=True
)
data.append(
{
"Title": title,
"Artist": artist,
"Debut Date": debut_date,
"Peak Pos": peak_pos,
"Peak Week": peak_week,
"Weeks on Chart": wks_on_chart,
}
)
df = pd.DataFrame(data)
print(df)
输出结果是:
Title Artist Debut Date Peak Pos Peak Week Weeks on Chart
0 (Just Like) Starting Over John Lennon 11.01.80 1 5 WKS 22
1 Woman John Lennon 01.17.81 2 12 Wks 20
2 Watching The Wheels John Lennon 03.28.81 10 12 Wks 17
3 Whatever Gets You Thru The Night John Lennon With The Plastic Ono Nuclear Band 09.28.74 1 1 WKS 15
4 Nobody Told Me John Lennon 01.21.84 5 12 Wks 14
5 Instant Karma (We All Shine On) John Ono Lennon 02.28.70 3 12 Wks 13
6 MIND GAMES John Lennon 11.10.73 18 12 Wks 13
7 #9 Dream John Lennon 12.21.74 9 12 Wks 12
8 Cold Turkey Plastic Ono Band 11.15.69 30 12 Wks 12
9 Imagine John Lennon/Plastic Ono Band 10.23.71 3 12 Wks 9
10 Give Peace A Chance Plastic Ono Band 07.26.69 14 12 Wks 9
11 Power To The People John Lennon/Plastic Ono Band Yoko Ono/Plastic Ono Band 04.03.71 11 12 Wks 9
12 Stand By Me John Lennon 03.15.75 20 12 Wks 9
13 Mother John Lennon/Plastic Ono Band Yoko Ono/Plastic Ono Band 01.09.71 43 12 Wks 6
14 Happy Xmas (War Is Over) John & Yoko/The Plastic Ono Band With The Harlem Community Choir 12.29.18 38 12 Wks 6
15 I'm Steppin' Out John Lennon 03.31.84 55 12 Wks 6
16 Woman Is The Nigger Of The World John Lennon/Plastic Ono Band With Elephant's Memory 05.20.72 57 12 Wks 5
17 Jealous Guy John Lennon & The Plastic Ono Band 10.15.88 80 12 Wks 4