我如何使用BeautifulSoup在IMDB网站上“描述”电影?

2024-05-08 16:16:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用BeautifulSoup在IMDB网站上抓取电影。我成功地掌握了电影的名称、类型、持续时间和评级。但我无法粗略地描述电影,因为当我看这些课程时,它是“文本静音”的,因为这个课程多次保存其他数据,如评级、流派、持续时间。但由于这些数据也有内部类,所以我更容易对其进行刮取,但当涉及到描述时,它没有任何内部类。因此,在提取数据时,仅使用“文本静音”也会提供其他数据。我怎样才能得到电影的描述

附上代码和屏幕截图以供参考: The red marked area is the class name of the description and the strip below movie name

我用来抓取流派的示例代码如下:

genre_tags=data.select(".text-muted .genre")
genre=[g.get_text() for g in genre_tags]
Genre = [item.strip() for item in genre if str(genre)]
print(Genre)

Tags: 数据代码textin文本for电影tags
2条回答

你可以用这个,:),如果你有帮助,请帮我解决。。thks

from bs4 import BeautifulSoup
from requests_html import HTMLSession

URL = 'https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm' #url of Most Popular Movies in IMDB

PAGE = HTMLSession().get(URL)
PAGE_BS4 = BeautifulSoup(PAGE.html.html,'html.parser')

MoviesObj = PAGE_BS4.find_all("tbody","lister-list") #get table body of Most Popular Movies
for index in range(len(MoviesObj[0].find_all("td","titleColumn"))):
    a = list(MoviesObj[0].find_all("td","titleColumn")[index])[1]
    href = 'https://www.imdb.com'+a.get('href') #get each link for movie page
    moviepage = HTMLSession().get(href) #request each page of movie
    moviepage = BeautifulSoup(moviepage.html.html,'html.parser')
    title = list(moviepage.find_all('h1')[0].stripped_strings)[0] #parse title
    year = list(moviepage.find_all('h1')[0].stripped_strings)[2] #parse year
    try:
        score = list(moviepage.find_all('div','ratingValue')[0].stripped_strings)[0] #parse score if is available
    except IndexError:
        score = '-' #if score is not available '-' is filled
    description = list(moviepage.find_all('div','summary_text')[0].stripped_strings)[0] #parse description
    print(f'TITLE: {title}      YEAR: {year}       SCORE: {score}\nDESCRIPTION:{description}\n') 
    

PRINT

萨尔达尼亚青年酒店 @乌姆萨尔达尼亚

一般来说,lxml比beautifulsoup好得多

import requests 
from lxml 
import html

url = "xxxx"

r = requests.get(url)

tree = html.fromstring(r.text)

rows = tree.xpath('//div[@class="lister-item mode-detail"]')

for row in rows:
    description = row.xpath('.//div[@class="ratings-bar"]/following-sibling::p[@class="text-muted"]/text()')[0].strip()

相关问题 更多 >