BeautifulSoup和if/else语句

from bs4 import BeautifulSoup import requests import re page = 'https://news.google.com/news/headlines?gl=US&ned=us&hl=en' #main page #url = raw_input("Enter a website to extract the URL's from: ") r = requests.get(page) #requests html document data = r.text #set data = to html text soup = BeautifulSoup(data, "html.parser") #parse data with BS for link in soup.find_all('a'): #if contains /news/ if ('/news/' in link.get('href')): print(link.get('href'))

https://cointelegraph.com/news/woman-in-denmark-imprisoned-for-hiring-hitman-using-bitcoin 0 https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost 1 https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost 2 https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration 3 https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration 4

1条回答

网友

1楼 · 发布于 2024-05-29 03:38:01

根据您的代码和提供的链接，BeautifulSoupfind_all搜索的结果似乎有重复项。需要检查html结构以了解返回重复项的原因（检查find_all搜索选项以过滤documentation中的一些内容。但是如果您想快速修复并想从打印的结果中删除重复项，您可以使用修改后的循环和下面的集合来跟踪看到的条目（基于this）。在

In [78]: l = [link.get('href') for link in soup.find_all('a') if '/news/' in link.get('href')]

In [79]: any(l.count(x) > 1 for x in l)                                                                                                              
Out[79]: True

上面的输出显示列表中存在重复项。现在要移除它们，请使用类似

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章