用BeautifulSoup提取<Script的内容 - 问答 - Python中文网

用BeautifulSoup提取<Script的内容

2024-05-14 16:11:05 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我试着用靓汤提取剧本的一部分，但它什么也没印出来。怎么了？

URL = "http://www.reuters.com/video/2014/08/30/woman-who-drank-restaurants-tainted-tea?videoId=341712453"
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)

for script in soup("script"):
        script.extract()

list_of_scripts = soup.findAll("script")
print list_of_scripts

2/目标是提取属性“transcript”的值：

<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "VideoObject",
    "video": {
        "@type": "VideoObject",
        "headline": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",
        "caption": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",  
        "transcript": "Jan Harding is speaking out for the first time about the ordeal that changed her life.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"Immediately my whole mouth was on fire.\"               The Utah woman was critically burned in her mouth and esophagus after taking a sip of sweet tea laced with a toxic cleaning solution at Dickey's BBQ.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"It was like a fire beyond anything you can imagine. I mean, it was not like drinking hot coffee.\"               Authorities say an employee mistakenly mixed the industrial cleaning solution containing lye into the tea thinking it was sugar.               The Hardings hope the incident will bring changes in the restaurant industry to avoid such dangerous mixups.               SOUNDBITE: JIM HARDING, HUSBAND, SAYING:               \"Bottom line, so no one ever has to go through this again.\"               The district attorney's office is expected to decide in the coming week whether criminal charges will be filed.",

Tags： of the in for type script restaurant soup

1条回答

网友

1楼 · 发布于 2024-05-14 16:11:05

extract从dom中删除标记。这就是为什么你会得到空名单。

使用type="application/ld+json"属性查找script，并使用json.loads对其进行解码。然后，您可以访问类似Python数据结构的数据。（给定数据的dict）

import json
import urllib2

from bs4 import BeautifulSoup

URL = ("http://www.reuters.com/video/2014/08/30/"
       "woman-who-drank-restaurants-tainted-tea?videoId=341712453")
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)

data = json.loads(soup.find('script', type='application/ld+json').text)
print data['video']['transcript']

相关问题更多 >

编程相关推荐

热门问题

热门文章