使用Cookies从Google Scholar导入数据(BibTeX)

1 投票
1 回答
683 浏览
提问于 2025-04-18 16:30

以下是代码:

import cookielib
import urllib2 
from bs4 import  BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0'}
url='http://scholar.google.co.in/scholar_setprefs?sciifh=1&scisig=AAGBfm0AAAAAU9jcmEN2h2yuBuZqQK8Es5dQG3ksjutw&inststart=0&num=10&scis=yes&scisf=4&hl=en&lang=all&instq=&save='

filename = "cookies.txt"
request = urllib2.Request(url, None, headers)
cookies = cookielib.MozillaCookieJar(filename, None, None)
cookies.load()
cookie_handler= urllib2.HTTPCookieProcessor(cookies)
redirect_handler= urllib2.HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler,cookie_handler)
response = opener.open(request)
print response.read()

输出错误:

C:\Python27\lib\_MozillaCookieJar.py:109: UserWarning: cookielib bug!
Traceback (most recent call last):
  File "C:\Python27\lib\_MozillaCookieJar.py", line 71, in _really_load
    line.split("\t")
ValueError: need more than 1 value to unpack

  _warn_unhandled_exception()
Traceback (most recent call last):
  File "C:\Users\new user\Desktop\pythonprac\working\googlescholar.py", line 10, in <module>
    cookies.load()
  File "C:\Python27\lib\cookielib.py", line 1763, in load
    self._really_load(f, filename, ignore_discard, ignore_expires)
  File "C:\Python27\lib\_MozillaCookieJar.py", line 111, in _really_load
    (filename, line))
cookielib.LoadError: invalid Netscape format cookies file 'cookies.txt': '.scholar.google.com     TRUE    /       FALSE   2147483647      GSP     ID=353e8f974d766dcd:CF=2'

这段代码是从网上找到的,我想把谷歌学术的bibtex数据下载到一个txt文件里。为此,我需要把用户的设置保存到一个cookie里。我正在把数据写入cookie.txt。但是我遇到了上面的错误。请指导我如何处理这个cookie错误,以及如何使用cookie来保存用户在google.scolar.com上定义的偏好设置。

1 个回答

2

我建议你使用另一组库。

from bs4 import BeautifulSoup
import requests

url= 'http://scholar.google.co.in/scholar_setprefs?sciifh=1&' +\
     'scisig=AAGBfm0AAAAAU9jcmEN2h2yuBuZqQK8Es5dQG3ksjutw' +\
     '&inststart=0&num=10&scis=yes&scisf=4&hl=en&lang=all&instq=&save='

page = requests.get(url)
cookies = page.cookies

page = requests.get(url, cookies=cookies)

print page.content

通过 cookies = page.cookies 这行代码,我可以获取网页的 cookies,并把它们保存到 cookies 这个变量里。然后我再次请求同一个网页,并把这个变量传过去。如果你有 cookies.txt 文件,可以把它加载为一个字典。


如果你想用标准库中的 urllib2 和 cookielib 来实现,确保 cookies.txt 文件的第一行是

# Netscape HTTP Cookie File

否则 cookielib 就无法加载它:https://stackoverflow.com/a/11536599/1688590

撰写回答