网页抓取最常见的名字

[(Anna Pavlovna, 7), (the prince, 7), (the Empress, 3), (The prince, 3), (Prince Vasili, 2)]

import nltk from urllib.request import urlopen from bs4 import BeautifulSoup html=urlopen('http://www.pythonscraping.com/pages/warandpeace.html') soup=BeautifulSoup(html,'html.parser') nameList = soup.findAll("span", {"class":"green"}) # may use bsObj.find_all() fdist1 = nltk.FreqDist(nameList) fdist1.most_common(5)

2条回答

网友

1楼 · 编辑于 2024-04-27 21:27:45

页面显示错误502坏网关，但我想我知道你的问题是什么。当你使用findAll时，它会给你bs4元素而不是字符串。因此，需要将其转换为字符串对象获取文本(). see documentation

items = soup.findAll("span", {"class": "green"})
texts = [item.get_text() for item in items]
# Now you have the texts of the span elements

顺便说一句，你的代码样本是不正确的，因为bsObj将不会被定义。你知道吗

网友

2楼 · 编辑于 2024-04-27 21:27:45

只要换一行：

nameList = soup.findAll("span", {"class":"green"})

对此：

nameList = [tag.text for tag in soup.findAll("span", {"class":"green"})]

findAll函数返回一个标记列表，以获取使用text属性的标记中的文本。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章