使用Python从IMDB网站抓取热门票房列表

-2 投票
3 回答
597 浏览
提问于 2025-04-18 08:01

网址:http://www.imdb.com/chart/?ref_=nv_ch_cht_2

我想让你从上面的网站上打印出最新的票房排行榜(包括所有电影的排名、标题、周末票房、总票房和上映周数,按顺序排列)

示例输出:
排名:1
标题:哥斯拉
周末票房:$93.2M
总票房:$93.2M
上映周数:1
排名:2
标题:邻居

3 个回答

0

下面这个Python脚本可以帮你获取:
1) 来自IMDb的票房最高电影列表
2) 以及每部电影的演员名单

from lxml.html import parse

def imdb_bo(no_of_movies=5):
    bo_url = 'http://www.imdb.com/chart/'
    bo_page = parse(bo_url).getroot()
    bo_table = bo_page.cssselect('table.chart')
    bo_total = len(bo_table[0][2])

    if no_of_movies <= bo_total:
        count = no_of_movies
    else:
        count = bo_total

    movies = {}

    for i in range(0, count):
        mo = {}
        mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
        mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
        mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
        mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
        mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
        mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()

        m_page = parse(mo['url']).getroot()
        m_casttable = m_page.cssselect('table.cast_list')

        flag = 0
        mo['cast'] = []
        for cast in m_casttable[0]:
            if flag == 0:
                flag = 1
            else:
                m_starname = cast[1][0][0].text_content().strip()
                mo['cast'].append(m_starname)

        movies[i] = mo

    return movies


if __name__ == '__main__':

    no_of_movies = raw_input("Enter no. of Box office movies to display:")
    bo_movies = imdb_bo(int(no_of_movies))

    for k,v in bo_movies.iteritems():
        print '#'+str(k+1)+'  '+v['title']+' ('+v['year']+')'
        print 'URL: '+v['url']
        print 'Weekend: '+v['weekend']
        print 'Gross: '+v['gross']
        print 'Weeks: '+v['weeks']
        print 'Cast: '+', '.join(v['cast'])
        print '\n'


输出(在终端运行时):

parag@parag-innovate:~/python$ python imdb_bo_scraper.py 
Enter no. of Box office movies to display:3
#1  Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden


#2  Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski


#3  Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long
0

其实不需要去抓取什么东西。你可以看看我在这里给的答案。

如何从IMDB商业页面抓取数据?

1

这只是一个简单的方法,用来通过 BeautifulSoup 提取那些实体。

from bs4 import BeautifulSoup                                          
import urllib2                                                         

url = "http://www.imdb.com/chart/?ref_=nv_ch_cht_2"                    

data = urllib2.urlopen(url).read()                                     
page = BeautifulSoup(data, 'html.parser')                              

rows = page.findAll("tr", {'class': ['odd', 'even']}) 

for tr in rows:             
    for data in tr.findAll("td", {'class': ['titleColumn', 'weeksColumn','ratingColumn']}):
        print data.get_text()

附注:可以根据自己的需要进行调整。

撰写回答