我建立了一个安全新闻网站垃圾,但链接是重复的

2024-05-07 23:42:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在建立一个网站刮板刮多个网站,所以我不必直接访问该网站。你知道吗

目前我有问题与重复网址,脚本做什么,我想,但链接被复制,我不想那样。你知道吗

这是我的密码:

def HackerNews():
hackerNews = ['https://www.darkreading.com/attacks-breaches.asp','https://www.darkreading.com/application-security.asp',
           'https://www.darkreading.com/vulnerabilities-threats.asp', 'https://www.darkreading.com/endpoint-security.asp',
           'https://www.darkreading.com/IoT.asp','https://www.darkreading.com/vulnerabilities-threats.asp'
           ]
keywords = ["bitcoin", "bit", "BTC", "Bit", "Security","Attack", "Breach","Cyber",
"Ransomware","Botnet","Worm","Hacked","Hack","Hackers","Flaw", "Risk","Danger" ]

for link in hackerNews:
   request = urllib2.Request(link)
   request.add_header('User-Agent', 'Mozilla 5.0')
   websitecontent = urllib2.urlopen(request).read()
   soup = BeautifulSoup(websitecontent, 'html.parser')

   headers = soup.findAll('header', {'class' : 'strong medium'})

   for h in headers:
       a = h.find("a")

       for keyword in keywords:
           if keyword in a["title"]:
               print("Title: " + a["title"] + " \nLink: " "https://darkreading.com" + a["href"])

HackerNews()

下面是一个输出示例:

标题:Android勒索软件包在黑暗网络中崛起
链接:https://darkreading.com/mobile/android-ransomware-kits-on-the-rise-in-the-dark-web-/d/d-id/1330591

标题:比特币矿商尼斯哈什遭黑客攻击,比特币可能损失6200万美元 链接:https://darkreading.com/cloud/bitcoin-miner-nicehash-hacked-possibly-losing-6200万美元比特币/d/d-id/1330585

标题:比特币矿商尼斯哈什遭黑客攻击,比特币可能损失6200万美元 链接:https://darkreading.com/cloud/bitcoin-miner-nicehash-hacked-possibly-losing-6200万美元比特币/d/d-id/1330585

标题:比特币矿商尼斯哈什遭黑客攻击,比特币可能损失6200万美元 链接:https://darkreading.com/cloud/bitcoin-miner-nicehash-hacked-possibly-losing-6200万美元比特币/d/d-id/1330585

标题:优步用10万美元的漏洞悬赏来支付,沉默佛罗里达黑客:报告 链接:https://darkreading.com/attacks-breaches/uber-used-10万美元的臭虫悬赏支付沉默佛罗里达黑客报告/d/d-id/1330584


Tags: inhttpscomid标题for网站链接
1条回答
网友
1楼 · 发布于 2024-05-07 23:42:25

好吧,不用直接打印,你可以制作一本包含所有链接的词典。或者如果你想把它整理成一个元组列表。 在附加之前,您可以检查它是否已经在列表中。你知道吗

def HackerNews():
hackerNews = ['https://www.darkreading.com/attacks-breaches.asp','https://www.darkreading.com/application-security.asp',
           'https://www.darkreading.com/vulnerabilities-threats.asp', 'https://www.darkreading.com/endpoint-security.asp',
           'https://www.darkreading.com/IoT.asp','https://www.darkreading.com/vulnerabilities-threats.asp'
           ]
keywords = ["bitcoin", "bit", "BTC", "Bit", "Security","Attack", "Breach","Cyber",
"Ransomware","Botnet","Worm","Hacked","Hack","Hackers","Flaw", "Risk","Danger" ]

output = []

for link in hackerNews:
   request = urllib2.Request(link)
   request.add_header('User-Agent', 'Mozilla 5.0')
   websitecontent = urllib2.urlopen(request).read()
   soup = BeautifulSoup(websitecontent, 'html.parser')

   headers = soup.findAll('header', {'class' : 'strong medium'})

   for h in headers:
       a = h.find("a")

       for keyword in keywords:
           if keyword in a["title"]:
               if (a["title"], a["href"]) not in output:
                   output.append((a["title"], a["href"]))

   for link in output:        
       print("Title: " + link[0] + " \nLink: " "https://darkreading.com" + link[1])

HackerNews()

没有解决你的压痕问题,也没有测试它。但它应该传达我的观点:)


编辑:为python 3工作:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup


def HackerNews():
    hackerNews = ['https://www.darkreading.com/attacks-breaches.asp','https://www.darkreading.com/application-security.asp',
               'https://www.darkreading.com/vulnerabilities-threats.asp', 'https://www.darkreading.com/endpoint-security.asp',
               'https://www.darkreading.com/IoT.asp','https://www.darkreading.com/vulnerabilities-threats.asp'
               ]
    keywords = ["bitcoin", "bit", "BTC", "Bit", "Security","Attack", "Breach","Cyber",
    "Ransomware","Botnet","Worm","Hacked","Hack","Hackers","Flaw", "Risk","Danger" ]

    output = []

    for link in hackerNews:
       request = Request(link)
       request.add_header('User-Agent', 'Mozilla 5.0')
       websitecontent = urlopen(request).read()
       soup = BeautifulSoup(websitecontent, 'html.parser')

       headers = soup.findAll('header', {'class' : 'strong medium'})

       for h in headers:
           a = h.find("a")

           for keyword in keywords:
               if keyword in a["title"]:
                   if (a["title"], a["href"]) not in output:
                       output.append((a["title"], a["href"]))

       for link in output:
           print("Title: " + link[0] + " \nLink: " "https://darkreading.com" + link[1])

HackerNews()

相关问题 更多 >