下载网页中所有链接的网络爬虫
我是一名Python初学者,写了一段代码来下载指定网址中的所有链接。请问有没有更好的方法?下面的代码是否正确?
#!/usr/bin/python3
import re
import requests
def get_page(url):
r = requests.get(url)
print(r.status_code)
content = r.text
return content
if __name__ =="__main__":
url = 'http://developer.android.com'
content = get_page(url)
content_pattern = re.compile('<a href=(.*?)>.*?</a>')
result = re.findall(content_pattern, content)
for link in result:
with open('download.txt', 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
2 个回答
2
试试这个:
from bs4 import BeautifulSoup
import sys
import requests
def get_links(url):
r = requests.get(url)
contents = r.content
soup = BeautifulSoup(contents)
links = []
for link in soup.findAll('a'):
try:
links.append(link['href'])
except KeyError:
pass
return links
if __name__ == "__main__":
url = sys.argv[1]
print get_links(url)
sys.exit()
1
你可以看看Linux系统里的wget
命令,它已经能满足你的需求。如果你真的想用Python来解决这个问题,那么mechanize和beautiful soup这两个工具可以帮你发送网络请求和解析HTML网页。