在没有任何标识的情况下,如何从代码中选择第二个div?

2024-04-25 05:32:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我不明白我需要做什么才能让第二个div进入第二个div和bs4。 我要把日期和div联系起来。谢谢你的帮助。你知道吗

代码如下:

<div class="featured-item-meta">
    <div><strong>Published:</strong></div>
    <div>October 14, 2015</div>
    <ul class="creatorList">
        <li>
            <div><strong>Writer:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite  Bennett</a></div>
        </li>
        <li>
            <div><strong>Cover Artist:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge  Molina</a></div>
        </li>
    </ul>
</div>

Tags: 代码httpsdivcomwwwliulclass
3条回答

使用bs4.7.1+很容易做到这一点。您可以使用:has:contains来获取父级div,其中子级strong包含字符串Published:,然后使用相邻的兄弟组合符来获取下一个div。你知道吗

from bs4 import BeautifulSoup

html = '''
<div class="featured-item-meta">
    <div><strong>Published:</strong></div>
    <div>October 14, 2015</div>
    <ul class="creatorList">
        <li>
            <div><strong>Writer:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite  Bennett</a></div>
        </li>
        <li>
            <div><strong>Cover Artist:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge  Molina</a></div>
        </li>
    </ul>
</div>
'''
soup = bs(html, 'lxml')
print(soup.select_one('div:has(strong:contains("Published:")) + div').text)

这里有一个解决方法

text = '<div class="featured-item-meta">\
<div><strong>Published:</strong></div>\
<div>October 14, 2015</div>\
<ul class="creatorList">\
    <li>\
        <div><strong>Writer:</strong></div>\
        <div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite  Bennett</a></div>\
    </li>\
    <li>\
        <div><strong>Cover Artist:</strong></div>\
        <div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge  Molina</a></div>\
    </li>\
</ul>\
</div>'

soap = BeautifulSoup(text,'html.parser')

print(soap.find('div',attrs={'class':'featured-item-meta'})\
          .find_all('div')[1].text)

输出:

October 14, 2015

Documentation about bs4 here

抓取文本Published:,然后使用find_next('div')获取日期。你知道吗

from bs4 import BeautifulSoup
html='''<div class="featured-item-meta">
    <div><strong>Published:</strong></div>
    <div>October 14, 2015</div>
    <ul class="creatorList">
        <li>
            <div><strong>Writer:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/10329/g_willow_wilson">G. Willow Wilson</a>, <a href="https://www.marvel.com/comics/creators/12441/marguerite_bennett">Marguerite  Bennett</a></div>
        </li>
        <li>
            <div><strong>Cover Artist:</strong></div>
            <div><a href="https://www.marvel.com/comics/creators/8825/jorge_molina">Jorge  Molina</a></div>
        </li>
    </ul>
</div>'''

soup=BeautifulSoup(html,'html.parser')
datetext=soup.find('div' , text='Published:').find_next('div').text
print(datetext)

输出

October 14, 2015

相关问题 更多 >