用beautifulsoup抓取纽约时报

0 投票
1 回答
1875 浏览
提问于 2025-04-17 21:34

我正在尝试从纽约时报上抓取文章,但总是遇到一大堆错误。我想知道有没有人能帮我指点一下方向。下面是我想抓取的文章链接、我的代码,以及控制台输出的内容。任何帮助都将非常感谢。

文章链接:http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0

import urllib2
from bs4 import BeautifulSoup
import re

# Ask user to enter URL
url = "http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0"

# Open txt document for output
txt = open('ctp_output.txt', 'w')

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Write the article title to the file    
title = soup.find("h1")
txt.write('\n' + "Title: " + title.string + '\n' + '\n')

# Write the article date to the file    
try:
    date = soup.find("span", {'class':'dateline'}).text
    txt.write("Date: " + str(date) + '\n' + '\n')
except:
    print "Could not find the date!"

# Write the article author to the file    
try:
    byline=soup.find("p", {'class':'byline-author'}).text
    txt.write("Author: " + str(byline) + '\n' + '\n')
except:
    print "Could not find the author!"

# Write the article location to the file    
regex = '<span class="location">(.+?)</span>'
pattern = re.compile(regex)
byline = re.findall(pattern,str(soup))
txt.write("Location: " + str(byline) + '\n' + '\n')

# retrieve all of the paragraph tags
with open('ctp_output.txt', 'w'):
    for tag in soup.find_all('p'):
        txt.write(tag.text.encode('utf-8') + '\n' + '\n')

# Close txt file with new content added
txt.close()

Sample output from console: 
andrews-mbp-3:CTP Andrew$ python idle_test.py
Please enter a valid URL: http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-        flight.html?ref=world&_r=0
Traceback (most recent call last):
  File "idle_test.py", line 20, in <module>
    soup = BeautifulSoup(urllib2.urlopen(url).read())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 442, in error
result = self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)

1 个回答

2

从错误列表(也叫做 traceback)中可以看到,第一个错误发生在第20行,就是你调用 urllib 的那部分。所以,检查一下你传给这个函数的内容。你的变量 url 应该是一个字符串,但它没有用引号括起来,这让我很奇怪为什么代码之前没有报错。

我之前说是第一个错误,因为你刚开始写代码的时候(这对大多数程序员来说都是这样,尤其是新手程序员),代码里会有很多错误。学习编程在很多方面就是在学习如何理解计算机给出的错误信息(traceback)。

更新

你刚刚把 url 的定义改成了 raw_input 函数。请不要这样做,因为这会让代码更难读和调试。urllib 在处理变量 url 时遇到了问题。把 url 的值搞得不清楚会让调试变得更困难。根据我的经验,我建议你检查一下是否包含(或者不包含) http 这样的语法可能会让你出错——但如果我看不到 url 的内容,我只能猜测。

撰写回答