在Python中美化组-获取类型的第n个标记

网友

1楼 · 编辑于 2024-05-12 12:59:25

这是我的版本

# Import bs4
from bs4 import BeautifulSoup

# Read your HTML
#html_doc = your html

# Get BS4 object
soup = BeautifulSoup(html_doc, "lxml")

# Find next Sibling Table to H3 Header with text "THE GOOD STUFF"    
the_good_table = soup.find(name='h3', text='THE GOOD STUFF').find_next_sibling(name='table')

# Find Second tr in your table
your_tr = the_good_table.findAll(name='tr')[1]

# Find Text Value of First td in your tr
your_string = your_tr.td.text

print(your_string)

输出：

'I WANT THIS STRING'

网友

2楼 · 编辑于 2024-05-12 12:59:25

要从调用soup.findAll('table')中获取第二个表，请将其用作列表，对其进行索引：

secondtable = soup.findAll('table')[1]

网友

3楼 · 编辑于 2024-05-12 12:59:25

马尔金·皮耶特的回答将使它真正起作用。我有过嵌套table标记的一些经验，当我只是简单地获得列表中的第二个表而没有注意时，它就破坏了我的代码。

当您尝试find_all并获取第n个元素时，可能会出现混乱，您最好找到所需的第一个元素，并确保第n个元素实际上是该元素的同级而不是子元素。

您可以使用find_next_sibling()来保护您的代码
您可以先找到父项，然后使用find_all（recursive=False）来保证搜索范围。

以防万一。我将在下面列出我的代码（使用recursive=FALSE）。

import urllib2
from bs4 import BeautifulSoup

text = """
<html>
    <head>
    </head>
    <body>
        <table>
            <p>Table1</p>
            <table>
                <p>Extra Table</p>
            </table>
        </table>
        <table>
            <p>Table2</p>
        </table>
    </body>
</html>
"""

soup = BeautifulSoup(text)

tables = soup.find('body').find_all('table')
print len(tables)
print tables[1].text.strip()
#3
#Extra Table # which is not the table you want without warning

tables = soup.find('body').find_all('table', recursive=False)
print len(tables)
print tables[1].text.strip()
#2
#Table2 # your desired output

相关问题更多 >

编程相关推荐

热门问题

热门文章

在Python中美化组-获取类型的第n个标记

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >