从维基百科的Div中抓取A标签链接文本到DataFrame列表中使用BeautifulSoup

1 投票
1 回答
25 浏览
提问于 2025-04-14 15:39

我现在正在学习编程... 我想从维基百科上抓取一些歌曲链接的文本,这些链接在一个用“a”标签包裹的区域里。不过,我现在只能获取到每个字母的第一个歌曲链接。我在提取文本时,遇到了一些链接没有标题,所以我只能提取文本而不是标题。如果有人能帮忙,那就太感谢了!

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://en.wikipedia.org/wiki/Category:Song_recordings_produced_by_John_Lennon'

data = requests.get(url)
soup = BeautifulSoup(data.content, "html.parser")
div = soup.find"div", {"class":"mw-category mw-category-columns"})


songs = []

for song in div:
    songs.append(song.find_next("a").text.strip())

print(songs)

输出结果:

['Air Talk', "Baby's Heartbeat", 'Cambridge 1969', 'Dear John (John Lennon song)', 
'Every Man Has a Woman Who Loves Him', 'F Is Not a Dirty Word', 'Gimme Some Truth', 
'Happy Xmas (War Is Over)', "I Don't Wanna Be a Soldier", 'Jamrag (song)', 
'Kiss Kiss Kiss (Yoko Ono song)', 'Listen, the Snow Is Falling', 'Many Rivers to Cross', 
'New York City (John Lennon and Yoko Ono song)', "O'Wind (Body Is the Scar of Your Mind)", 
'Paper Shoes', 'Radio Play (song)', 'Scared (John Lennon song)', 'Telephone Piece', 
'Waiting for the Sunrise (song)', 'Yang Yang (song)']

1 个回答

1

你可以参考这个例子,看看怎么把所有182首歌放到一个列表里:

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Category:Song_recordings_produced_by_John_Lennon"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")


songs = [a.text for a in soup.select("#mw-pages li a")]

print(*songs, sep="\n")
print()
print(f"Songs total={len(songs)}")

输出结果是:


...

Yellow Girl (Stand by for Life)
Yes, I'm Your Angel
You (Yoko Ono song)
You Are Here (song)
You're the One (Yoko Ono song)

Songs total=182

撰写回答