使用BeautifulSoup抓取Billboard Hot 100艺术家单曲历史

1 投票
3 回答
60 浏览
提问于 2025-04-13 14:30

我正在尝试从一个艺术家的公告牌页面上抓取所有与他们的单曲及其表现相关的信息。我想要重新实现一个我在别处看到的解决方案。这个方法在某个阶段是有效的,但一旦我到达“最高排名”那一栏,我就不知道怎么把“最高日期”和“周数”从表格中提取出来。我基本上是想把网站上表格中显示的所有信息都抓取下来,最后放到一个数据框里,但就是抓不到最后两列。任何建议都会非常感谢。谢谢!

import requests
from bs4 import BeautifulSoup

url = requests.get('https://www.billboard.com/artist/john-lennon/chart-history/hsi/')
soup = BeautifulSoup(url.content, 'html.parser')
result = soup.find_all('div','o-chart-results-list-row')

for res in result:
    song = res.find('h3').text.strip()
    artist = res.find('h3').find_next('span').text.strip()
    debute = res.find('span').find_next('span').text.strip()
    peak = res.find('a').find_next('span').text.strip()
    #peak_date = ?
    #wks = ?

    print("song: "+str(song))
    print("artist: "+ str(artist))
    print("debute: "+ str(debute))
    print("peak: "+ str(peak))
    print("___________________________________________________")

歌曲: (Just Like) Starting Over
艺术家: John Lennon
首发: 11.01.80
最高排名: 1
最高日期:
周数:

3 个回答

0

一般来说,有几种方法可以从html文档中获取元素。你可以像之前那样使用链式调用find/find_next。这种方法是可行的,可以用来获取你想要的周数和峰值日期。

peak_date = res.find("a").find_next("a").text.strip()
wks = res.find("a").find_next("a").find_next("span").text.strip()

不过,更好的办法是直接通过类名来查找这些元素。这样即使元素的顺序发生了变化,只要类名不变,你的脚本也能正常工作。它可能看起来像这样:

peak_date = res.find("span", class_="artist-chart-row-peak-date").text.strip()
wks = res.find("span", class_="artist-chart-row-week-on-chart").text.strip()

完整的代码就会是:

import requests
from bs4 import BeautifulSoup

url = requests.get('https://www.billboard.com/artist/john-lennon/chart-history/hsi/')
soup = BeautifulSoup(url.content, 'html.parser')
result = soup.find_all('div','o-chart-results-list-row')

for res in result:
    song = res.find('h3').text.strip()
    artist = res.find('h3').find_next('span').text.strip()
    debute = res.find('span').find_next('span').text.strip()
    peak = res.find('a').find_next('span').text.strip()

    # Sloppy solution by chaining find_next
    # peak_date = res.find("a").find_next("a").text.strip()
    # wks = res.find("a").find_next("a").find_next("span").text.strip()

    # Better solution by searching for elements with class name
    peak_date = res.find("span", class_="artist-chart-row-peak-date").text.strip()
    wks = res.find("span", class_="artist-chart-row-week-on-chart").text.strip()

    print("song: "+str(song))
    print("artist: "+ str(artist))
    print("debute: "+ str(debute))
    print("peak: "+ str(peak))
    print("peak date: " + str(peak_date))
    print("weeks: " + str(wks))
    print("___________________________________________________")
0

我会查看页面的源代码,看看每一列的位置,并利用类名来获取数据(比如,对于peak_date这个值,你可以在下一个<a>标签中找到,而对于周数,你可以在下一个<span>标签中找到,那个标签的类名是"artist-chart-row-week-on-chart")。

获取你想要的数据的完整代码如下:

import requests
from bs4 import BeautifulSoup

url = requests.get('https://www.billboard.com/artist/john-lennon/chart-history/hsi/')
soup = BeautifulSoup(url.content, 'html.parser')
result = soup.find_all('div','o-chart-results-list-row')

for res in result:
    song = res.find('h3').text.strip()
    artist = res.find('h3').find_next('span').text.strip()
    debute = res.find('span').find_next('span').text.strip()
    peak = res.find('a').find_next('span').text.strip()
    peak_date = res.find('a').find_next('a').text.strip()
    wks = res.find_next('span','artist-chart-row-week-on-chart').text.strip()

    print("song: "+str(song))
    print("artist: "+ str(artist))
    print("debute: "+ str(debute))
    print("peak: "+ str(peak))
    print("peak_date: "+ str(peak_date))
    print("wks: "+ str(wks))    
    print("___________________________________________________")

更新!!

我还会告诉你如何用pandas数据框来实现你想要的效果(最佳做法,可能也是最快的方法,是通过append使用字典列表。此外,我还会提供几个解析的方法,如果你想把日期存储成合适的格式的话):

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = requests.get('https://www.billboard.com/artist/john-lennon/chart-history/hsi/')
soup = BeautifulSoup(url.content, 'html.parser')
result = soup.find_all('div','o-chart-results-list-row')

my_dictionary = []

for res in result:
    song = res.find('h3').text.strip()
    artist = res.find('h3').find_next('span').text.strip()
    debute = res.find('span').find_next('span').text.strip()
    peak = res.find('a').find_next('span').text.strip()
    peak_date = res.find('a').find_next('a').text.strip()
    wks = res.find_next('span','artist-chart-row-week-on-chart').text.strip()
    
    my_dictionary.append({"song": song, 
                          "artist": artist, 
                          "debute": pd.to_datetime(debute, format='%m.%d.%y', errors='coerce'), 
                          "peak": peak, 
                          "peak_date": pd.to_datetime(peak_date, format='%m.%d.%y', errors='coerce'), 
                          "wks": wks})
    
my_dataframe = pd.DataFrame.from_dict(my_dictionary)
1

试试这个:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://www.billboard.com/artist/john-lennon/chart-history/hsi/"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

data = []
for row in soup.select(".o-chart-results-list-row"):
    title = row.h3.get_text(strip=True)
    artist = row.span.get_text(strip=True)
    debut_date = row.select_one(".artist-chart-row-debut-date").get_text(strip=True)
    peak_pos = row.select_one(".artist-chart-row-peak-pos").get_text(strip=True)
    peak_week = row.select_one(".artist-chart-row-peak-week").get_text(strip=True)
    peak_date = row.select_one(".artist-chart-row-peak-date").get_text(strip=True)
    wks_on_chart = row.select_one(".artist-chart-row-week-on-chart").get_text(
        strip=True
    )
    data.append(
        {
            "Title": title,
            "Artist": artist,
            "Debut Date": debut_date,
            "Peak Pos": peak_pos,
            "Peak Week": peak_week,
            "Weeks on Chart": wks_on_chart,
        }
    )


df = pd.DataFrame(data)
print(df)

输出结果是:

                               Title                                                            Artist Debut Date Peak Pos Peak Week Weeks on Chart
0          (Just Like) Starting Over                                                       John Lennon   11.01.80        1     5 WKS             22
1                              Woman                                                       John Lennon   01.17.81        2    12 Wks             20
2                Watching The Wheels                                                       John Lennon   03.28.81       10    12 Wks             17
3   Whatever Gets You Thru The Night                     John Lennon With The Plastic Ono Nuclear Band   09.28.74        1     1 WKS             15
4                     Nobody Told Me                                                       John Lennon   01.21.84        5    12 Wks             14
5    Instant Karma (We All Shine On)                                                   John Ono Lennon   02.28.70        3    12 Wks             13
6                         MIND GAMES                                                       John Lennon   11.10.73       18    12 Wks             13
7                           #9 Dream                                                       John Lennon   12.21.74        9    12 Wks             12
8                        Cold Turkey                                                  Plastic Ono Band   11.15.69       30    12 Wks             12
9                            Imagine                                      John Lennon/Plastic Ono Band   10.23.71        3    12 Wks              9
10               Give Peace A Chance                                                  Plastic Ono Band   07.26.69       14    12 Wks              9
11               Power To The People            John Lennon/Plastic Ono Band Yoko Ono/Plastic Ono Band   04.03.71       11    12 Wks              9
12                       Stand By Me                                                       John Lennon   03.15.75       20    12 Wks              9
13                            Mother            John Lennon/Plastic Ono Band Yoko Ono/Plastic Ono Band   01.09.71       43    12 Wks              6
14          Happy Xmas (War Is Over)  John & Yoko/The Plastic Ono Band With The Harlem Community Choir   12.29.18       38    12 Wks              6
15                  I'm Steppin' Out                                                       John Lennon   03.31.84       55    12 Wks              6
16  Woman Is The Nigger Of The World               John Lennon/Plastic Ono Band With Elephant's Memory   05.20.72       57    12 Wks              5
17                       Jealous Guy                                John Lennon & The Plastic Ono Band   10.15.88       80    12 Wks              4

撰写回答