我做了一个脚本,从我的网站(内部和外部)收集所有的链接,并给我断开的链接
这是我的代码,它运行良好:
import requests
# from urllib.parse import urljoin
from urlparse import urlparse, urljoin
from bs4 import BeautifulSoup
import sys
# initialize the set of links (unique links)
internal_urls = set()
external_urls = set()
# number of urls visited so far will be stored here
total_urls_visited = 0
total_broken_link = set()
output = 'output.txt'
def is_valid(url):
"""
Checks whether `url` is a valid URL.
"""
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
"""
Almost any value is evaluated to True if it has some sort of content.
Every Url should follow a specific format: <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
Example: http://www.example.com/index?search=src
Here, www.example.com is your netloc, while index is the path,
search is the query parameter, and src is the value being passed along the parameter search.
This will make sure that a proper scheme (protocol, e.g http or https) and domain name exists in the URL.
"""
def get_all_website_links(url):
"""
Returns all URLs that is found on `url` in which it belongs to the same website
"""
# all URLs of `url`, we use python set() cause we don't redondant links
urls = set()
# domain name of the URL without the protocol, to check if the link is internal or external
domain_name = urlparse(url).netloc
#Python library for pulling data out of HTML or XML files
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# print(soup.prettify()) #test if the html of the page is correctly displaying
# print(soup.find_all('a')) #collect all the anchor tag
for a_tag in soup.findAll("a"):
href = a_tag.get("href")
if href == "" or href is None:
# href empty tag
continue
href = urljoin(url, href) #internal urls
#print(internal_urls)
# print('href:' + href)
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
if domain_name not in href:
# external link
if href not in external_urls:
# print("External link:" + href)
# print((requests.get(href)).status_code)
is_broken_link(href)
external_urls.add(href)
continue
# print("Internal link:" + href)
# print((requests.get(href)).status_code)
is_broken_link(href)
urls.add(href) #because it is not an external link
internal_urls.add(href) #because it is not an external link
return urls
def is_broken_link(url):
if ((requests.get(url)).status_code) != 200:
#print("This link is broken")
print(url.encode('utf-8'))
total_broken_link.add(url)
return True
else:
#print("This link works well")
return False
def crawl(url, max_urls=80):
"""
Crawls a web page and extracts all links.
You'll find all links in `external_urls` and `internal_urls` global set variables.
params:
max_urls (int): number of max urls to crawl.
"""
global total_urls_visited
total_urls_visited += 1
links = get_all_website_links(url)
for link in links:
if total_urls_visited > max_urls:
break
crawl(link, max_urls=max_urls)
if __name__ == "__main__":
crawl('https://www.example.com/')
print('Total External links:' + str(len(external_urls)))
print('Total Internal links:' + str(len(internal_urls)))
print('Total:' + str(len(external_urls) + len(internal_urls)))
print('Be careful: ' + str(len(total_broken_link)) + ' broken links found !')
当我运行脚本时,它会返回所有断开的链接以及断开链接的数量
但我还想显示每个断开链接的路径
例如,如果我找到这个断开的链接https://www.example.com/brokenlink(内部断开的链接)或这个https://www.otherwebsite.com/brokenlink(外部断开的链接)
我想知道在我的代码中那些断开的链接在哪里被调用,我的意思是在哪个页面中才能解决这个问题。 如果我知道这些断开的链接在我的代码中的什么地方,我可以很容易地找到它们并删除它们,这样就不会再有这个问题了
因此,我希望这个脚本允许我显示每个断开的链接及其路径,然后显示断开链接的数量
我希望它足够清楚
目前没有回答
相关问题 更多 >
编程相关推荐