使用BeautifulSoup提取重复标记中的特定文本

3条回答

网友

1楼 · 编辑于 2024-04-26 22:10:31

我可以用下面的代码来实现你想要的东西：

import urllib.request
import urllib.parse
from bs4 import BeautifulSoup

url = "http://pitts.emory.edu/dia/image_details.cfm?ID=17250"
f = urllib.request.urlopen(url)

soup = BeautifulSoup(f, 'html.parser')
parent = soup.find("b", text="Description:").parent
parent.find("b", text="Description:").decompose()
print(parent.text)

我添加了BeautifulSoup并删除了描述。你知道吗

网友

2楼 · 编辑于 2024-04-26 22:10:31

在这个例子中，我们可以使用CSS选择器。假设您使用的是beautifulsoup4.7+，soupsieve库提供了CSS选择器支持。我们首先要使用:has()csslevel4选择器来查找具有直接子<b>标记的<p>标记，然后使用soupsieve的非标准:contains选择器来确保<b>标记包含Description:。然后，我们只需打印出符合此条件的所有元素的内容，去掉前导和尾随空格，去掉Description:。请记住，有多种方法可以做到这一点，这正是我选择的方法来说明：

import bs4

markup = """
</div>
<div class="col-sm-6">
<P>
    <b>Book Title:</b>
    <A HREF="book_detail.cfm?ID=2449">The Holy Bible containing the Old and New Testaments, according to the authorised version. With illustrations by Gustave Doré</a>
</p>


    <P>
        <b>Author:</b> Doré, Gustave, 1832-1883
    </p>

    <P>
        <b>Image Title:</b> Baptism of Jesus
    </p>

    <P>
        <b>Scripture Reference:</b><ul><li>John 1 (<a href='search.cfm?biblicalbook=John&biblicalbookchapter=1'>further images</a> / <a rel='shadowbox;height=500;width=600' href='http://www.commonenglishbible.com/explore/passage-lookup/?query=John+1'>scripture text</a>)</li></ul>
    </p>

        <P>
            <b>Description:</b> John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.
        </P>


    <P>
        <A HREF="book_list.cfm?ID=2449">Click here
        </a> for additional images available from this book.
    </P>

    <p>For information on licensing this image, please send an email, including a link to the image, to 
        <a href="mailto:dia@emory.edu?subject=Licensing%20Image%20From%20DIA - 17250">dia@emory.edu</a>
    </p>


</div>
"""

soup = bs4.BeautifulSoup(markup, "html.parser")

for el in soup.select('p:has(> b:contains("Description:"))'):
    print(el.get_text().strip('').replace('Description: ', ''))

输出：

John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.

网友

3楼 · 编辑于 2024-04-26 22:10:31

我使用了<；p>；标记作为索引，然后选择了[4]索引。我只是一个新手，但它成功了。你知道吗

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://pitts.emory.edu/dia/image_details.cfm?ID=17250")

soup = BeautifulSoup(html, 'html.parser')
page = soup.find_all('p')[4].getText()

print(page)

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用BeautifulSoup提取重复标记中的特定文本

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >