A正在学习如何使用BeautifulSoup,我在写的一个循环中遇到了双重打印的问题。在
任何洞察力都将不胜感激!在
from bs4 import BeautifulSoup
import requests
import re
page = 'https://news.google.com/news/headlines?gl=US&ned=us&hl=en' #main page
#url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(page) #requests html document
data = r.text #set data = to html text
soup = BeautifulSoup(data, "html.parser") #parse data with BS
for link in soup.find_all('a'):
#if contains /news/
if ('/news/' in link.get('href')):
print(link.get('href'))
示例:
^{pr2}$输出:
https://cointelegraph.com/news/woman-in-denmark-imprisoned-for-hiring-hitman-using-bitcoin
0
https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost
1
https://cointelegraph.com/news/ethereum-price-hits-all-time-high-of-750-following-speed-boost
2
https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration
3
https://cointelegraph.com/news/senior-vp-says-ebay-seriously-considering-bitcoin-integration
4
出于某种原因,我的代码总是打印出同一个url两次。。。在
根据您的代码和提供的链接,BeautifulSoup
find_all
搜索的结果似乎有重复项。需要检查html结构以了解返回重复项的原因(检查find_all
搜索选项以过滤documentation中的一些内容。但是如果您想快速修复并想从打印的结果中删除重复项,您可以使用修改后的循环和下面的集合来跟踪看到的条目(基于this)。在上面的输出显示列表中存在重复项。现在要移除它们,请使用类似
^{pr2}$相关问题 更多 >
编程相关推荐