使用aiohttp检测HTTP响应编码
我正在学习如何使用 asyncio
来构建一个异步的网页爬虫。下面是一个简单的爬虫,用来测试这个框架:
import asyncio, aiohttp
from bs4 import BeautifulSoup
@asyncio.coroutine
def fetch(url):
with (yield from sem):
print(url)
response = yield from aiohttp.request('GET',url)
response = yield from response.read_and_close()
return response.decode('utf-8')
@asyncio.coroutine
def get_links(url):
page = yield from fetch(url)
soup = BeautifulSoup(page)
links = soup.find_all('a',href=True)
return [link['href'] for link in links if link['href'].find('www') != -1]
@asyncio.coroutine
def crawler(seed, depth, max_depth=3):
while True:
if depth > max_depth:
break
links = yield from get_links(seed)
depth+=1
coros = [asyncio.Task(crawler(link,depth)) for link in links]
yield from asyncio.gather(*coros)
sem = asyncio.Semaphore(5)
loop = asyncio.get_event_loop()
loop.run_until_complete(crawler("http://www.bloomberg.com",0))
虽然 asyncio
的文档看起来很完善,但 aiohttp
的文档似乎很少,所以我在自己弄明白一些事情时遇到了困难。
首先,我们有没有办法检测网页响应的编码方式?其次,我们能否要求在一个会话中保持连接持续有效?还是说这在 requests
中默认就是这样?
1 个回答
1
你可以查看 response.headers['Content-Type']
,或者使用 chardet
这个库来处理格式不正确的 HTTP 响应。响应的内容是 bytes
字符串。
对于 keep-alive 连接,你应该使用像下面这样的 connector
:
connector = aiohttp.TCPConnector(share_cookies=True)
response1 = yield from aiohttp.request('get', url1, connector=connector)
body1 = yield from response1.read_and_close()
response2 = aiohttp.request('get', url2, connector=connector)
body2 = yield from response2.read_and_close()