Python每天从没有任何feed的站点抓取新闻文章

from bs4 import BeautifulSoup from urllib.request import Request, urlopen import re def getLinks(url): USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5' request = Request(url) request.add_header('User-Agent', USER_AGENT) response = urlopen(request) content = response.read().decode('utf-8') response.close() soup = BeautifulSoup(content, "html.parser") links = [] for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): links.append(link.get('href')) return links print(getLinks("https://www.jugantor.com/"))

1条回答

网友

1楼 · 发布于 2024-04-26 18:27:18

我不确定是否理解正确，但我首先看到的是{'href': re.compile("^http://")}。在

您将错过所有https和相关链接。亲戚链接可以跳过这里没有任何问题（我猜…），但显然不是https的问题。所以第一件事：

{'href': re.compile("^https?://")}

然后，为了避免每天下载和解析相同的URL，您可以提取文章的id（在https://www.jugantor.com/lifestyle/19519/%E0%...A7%87id是19519），将其保存在数据库中，然后在抓取页面之前首先验证该id是否存在。在

最后一件事，我不确定这是否有用，但是这个url https://www.jugantor.com/todays-paper/让我认为您应该只能找到今天的新闻。在

相关问题更多 >

编程相关推荐

热门问题

热门文章