使用Cookies从Google Scholar导入数据(BibTeX)
以下是代码:
import cookielib
import urllib2
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0'}
url='http://scholar.google.co.in/scholar_setprefs?sciifh=1&scisig=AAGBfm0AAAAAU9jcmEN2h2yuBuZqQK8Es5dQG3ksjutw&inststart=0&num=10&scis=yes&scisf=4&hl=en&lang=all&instq=&save='
filename = "cookies.txt"
request = urllib2.Request(url, None, headers)
cookies = cookielib.MozillaCookieJar(filename, None, None)
cookies.load()
cookie_handler= urllib2.HTTPCookieProcessor(cookies)
redirect_handler= urllib2.HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler,cookie_handler)
response = opener.open(request)
print response.read()
输出错误:
C:\Python27\lib\_MozillaCookieJar.py:109: UserWarning: cookielib bug!
Traceback (most recent call last):
File "C:\Python27\lib\_MozillaCookieJar.py", line 71, in _really_load
line.split("\t")
ValueError: need more than 1 value to unpack
_warn_unhandled_exception()
Traceback (most recent call last):
File "C:\Users\new user\Desktop\pythonprac\working\googlescholar.py", line 10, in <module>
cookies.load()
File "C:\Python27\lib\cookielib.py", line 1763, in load
self._really_load(f, filename, ignore_discard, ignore_expires)
File "C:\Python27\lib\_MozillaCookieJar.py", line 111, in _really_load
(filename, line))
cookielib.LoadError: invalid Netscape format cookies file 'cookies.txt': '.scholar.google.com TRUE / FALSE 2147483647 GSP ID=353e8f974d766dcd:CF=2'
这段代码是从网上找到的,我想把谷歌学术的bibtex数据下载到一个txt文件里。为此,我需要把用户的设置保存到一个cookie里。我正在把数据写入cookie.txt。但是我遇到了上面的错误。请指导我如何处理这个cookie错误,以及如何使用cookie来保存用户在google.scolar.com上定义的偏好设置。
1 个回答
2
我建议你使用另一组库。
from bs4 import BeautifulSoup
import requests
url= 'http://scholar.google.co.in/scholar_setprefs?sciifh=1&' +\
'scisig=AAGBfm0AAAAAU9jcmEN2h2yuBuZqQK8Es5dQG3ksjutw' +\
'&inststart=0&num=10&scis=yes&scisf=4&hl=en&lang=all&instq=&save='
page = requests.get(url)
cookies = page.cookies
page = requests.get(url, cookies=cookies)
print page.content
通过 cookies = page.cookies
这行代码,我可以获取网页的 cookies,并把它们保存到 cookies
这个变量里。然后我再次请求同一个网页,并把这个变量传过去。如果你有 cookies.txt
文件,可以把它加载为一个字典。
如果你想用标准库中的 urllib2 和 cookielib 来实现,确保 cookies.txt
文件的第一行是
# Netscape HTTP Cookie File
否则 cookielib 就无法加载它:https://stackoverflow.com/a/11536599/1688590