使用beauthulsoup从wikipedia表中获取列

2024-05-13 17:50:30 发布

您现在位置:Python中文网/ 问答频道 /正文

source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.text)
tables = soup.find_all("table")

我试图从Taylor Swift's discography的“单曲列表”表中获取歌曲名称列表

这个表没有唯一的类或id。我能想到的唯一唯一的东西是“单曲列表…”周围的标题标签

List of singles as main artist, with selected chart positions, sales figures and certifications

我试过了:

^{pr2}$

但它什么也不返回,我假设caption不是bs4中可识别的标记?在


Tags: orghttpsource列表getwikicodewikipedia
2条回答

这是一个完整的例子,解决了“泰勒-斯威夫特问题”。首先查找包含文本“List of singles”的标题并移动到父对象。下一步迭代包含要查找的文本的项:

for caption in soup.findAll("caption"):
    if "List of singles" in caption.text:      
        break

table = caption.parent
for item in table.findAll("th", {"scope":"row"}):
    print item.text

这样可以得到:

^{pr2}$

它实际上与findAll()find_all()无关。findAll()BeautifulSoup3中使用,而留在{}中是出于兼容性的考虑,引用bs4的源代码:

def find_all(self, name=None, attrs={}, recursive=True, text=None,
             limit=None, **kwargs):
    generator = self.descendants
    if not recursive:
        generator = self.children
    return self._find_all(name, attrs, text, limit, generator, **kwargs)

findAll = find_all       # BS3

而且,有一种更好的方法来获得单曲列表,它依赖于带有id="Singles"span元素,它表示Singles段落的开始。然后,使用^{}获取span标记父标记后面的第一个表。然后,使用scope="row"获取所有th元素:

^{pr2}$

印刷品:

"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
"Mine"
"Back to December"
"Mean"
"The Story of Us"
"Sparks Fly"
"Ours"
"Safe & Sound"
(featuring The Civil Wars)
"Long Live"
(featuring Paula Fernandes)
"Eyes Open"
"We Are Never Ever Getting Back Together"
"Ronan"
"Begin Again"
"I Knew You Were Trouble"
"22"
"Highway Don't Care"
(with Tim McGraw)
"Red"
"Everything Has Changed"
(featuring Ed Sheeran)
"Sweeter Than Fiction"
"The Last Time"
(featuring Gary Lightbody)
"Shake It Off"
"Blank Space"

相关问题 更多 >