Python BS4未检索结果

2021-05-16 07:43:52 发布

您现在位置:Python中文网/ 问答频道 /正文

使用下面的代码,我可以毫无问题地获取“soup”。我的目标是最终在soup对象中获取标题,但我很难弄清楚如何实现它。除了下面的内容,我还尝试了soup['results']的各种迭代,1.结果, soup.get\u文本()结果。。等等,不知道怎么去。我当然可以soup.get\u文本() ... (字符串“title”的某种搜索函数,但感觉必须有一个内置的方法。你知道吗

55)get_title()
     54     ipdb.set_trace()
---> 55     title = soup.html.head.title.string
     56     title = re.sub(r'[^\x00-\x7F]+',' ', title)

ipdb> type(soup)
<class 'bs4.BeautifulSoup'>
ipdb> soup.title
ipdb> print soup.title
None
ipdb> soup
{"status":"OK","copyright":"Copyright (c) 2018 The New York Times Company. All Rights Reserved.","section":"home","last_updated":"2018-01-07T06:19:00-05:00","num_results":42,"results":[{"section":"Briefing","subsection":"",**"title":"Trump, Palestinians, Golden Globes: Your Weekend Briefing"**, ....

代码

from __future__ import division

import regex as re
import string
import urllib2

from bs4 import BeautifulSoup
from cookielib import CookieJar
import ipdb

PARSER_TYPE = 'html.parser'

def get_title(url):
    cj = CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    p = opener.open(url)
    soup = BeautifulSoup(p.read(), PARSER_TYPE) # This loads fine
    ipdb.set_trace()
    title = soup.html.head.title.string # This is sad
    title = re.sub(r'[^\x00-\x7F]+',' ', title)
    return title