从列表元素中提取的Web爬虫程序

Traceback (most recent call last): File "C:\Users\sony\Desktop\Trash\Crawler Try\trytest.py", line 13, in <module> soup =BeautifulSoup(li[count]) File "C:\Python27\lib\site-packages\bs4\__init__.py", line 161, in __init__ markup = markup.read() TypeError: 'NoneType' object is not callable [Finished in 4.0s with exit code 1]

1条回答

网友

1楼 · 发布于 2024-04-26 06:09:53

问题是-有一些不相关的li标记不包含您需要的数据。在

更具体一点。例如，如果您想获得“20世纪”事件的列表，请首先找到标题并从其父级的following ^{} sibling获取事件列表。此外，并非列表中的每个项目都有%B %d, %Y格式的日期-您需要通过try/except块来处理它：

import urllib2
from datetime import datetime
from bs4 import BeautifulSoup


page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)

events = soup.find('span', id='20th_century').parent.find_next_sibling('ul')
for event in events.find_all('li'):
    try:
        date_string, rest = event.text.split(':', 1)
        print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
    except ValueError:
        print event.text

印刷品：

^{pr2}$

更新版本（获取所有低于一个世纪的ul组）：

events = soup.find('span', id='20th_century').parent.find_next_siblings()
for tag in events:
    if tag.name == 'h2':
        break
    for event in tag.find_all('li'):
        try:
            date_string, rest = event.text.split(':', 1)
            print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
        except ValueError:
            print event.text

相关问题更多 >

编程相关推荐

热门问题

热门文章