在特定时间后从HTML中获取链接

2024-04-26 19:04:22 发布

您现在位置:Python中文网/ 问答频道 /正文

查看以下HTML代码:

<html>
    <body>
        <p class="fixedfonts">
            <a href="A.pdf">LINK1</a>
        </p>

        <h2>Results</h2>

        <p class="fixedfonts">
            <a href="B.pdf">LINK2</a>
        </p>

        <p class="fixedfonts">
            <a href="C.pdf">LINK3</a>
        </p>
    </body>
</html>

它包含3个链接。但是,我只需要检索标题Results之后的链接

我正在将python与BeautifulSoup结合使用:

from bs4 import BeautifulSoup, SoupStrainer

# at this point html contains the code as string

# parse the HTML file
soup = BeautifulSoup(html.replace('\n', ''), parse_only=SoupStrainer('a'))

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()  # rip it out

links = list()
for link in soup:
    if link.has_attr('href'):
        links.append(link['href'].replace('%20', ' '))

print(links)

通过提供的代码,我获得了文档中的所有链接,但正如我所说的,我只需要那些在Results标记/标题之后的链接。你知道吗

感谢您的指导


Tags: 代码pdf链接htmllinkscriptbodylinks
3条回答

将html数据分成两部分,在“结果”之前和之后,然后使用后面的一部分来处理它:

data = html.split("Results")
need = data[1]

所以只要实现这一点:

from bs4 import BeautifulSoup, SoupStrainer
data = html.split("Results")
need = data[1]
soup = BeautifulSoup(need.replace('\n', ''), parse_only=SoupStrainer('a'))

您可以使用^{} method来解决这个问题:

results = soup.find("h2", text="Results")
for link in results.find_all_next("a"):
    print(link.get("href"))

演示:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <html>
...     <body>
...         <p class="fixedfonts">
...             <a href="A.pdf">LINK1</a>
...         </p>
... 
...         <h2>Results</h2>
... 
...         <p class="fixedfonts">
...             <a href="B.pdf">LINK2</a>
...         </p>
... 
...         <p class="fixedfonts">
...             <a href="C.pdf">LINK3</a>
...         </p>
...     </body>
... </html>"""
>>> 
>>> soup = BeautifulSoup(data, "html.parser")
>>> results = soup.find("h2", text="Results")
>>> for link in results.find_all_next("a"):
...     print(link.get("href"))
... 
B.pdf
C.pdf

测试和似乎工作。你知道吗

from bs4 import BeautifulSoup, SoupStrainer

html = '''<html>
<body>
    <p class="fixedfonts">
        <a href="A.pdf">LINK1</a>
    </p>

    <h2>Results</h2>

    <p class="fixedfonts">
        <a href="B.pdf">LINK2</a>
    </p>

    <p class="fixedfonts">
        <a href="B.pdf">LINK2</a>
    </p>

    <p class="fixedfonts">
        <a href="C.pdf">LINK3</a>
    </p>
</body>
</html>'''

# at this point html contains the code as string

# parse the HTML file
dat = html.split("Result")
need = dat[1]
soup = BeautifulSoup(html.replace('\n', ''), parse_only=SoupStrainer('a'))

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()  # rip it out

links = list()
for link in soup:
    if link.has_attr('href'):
         links.append(link['href'].replace('%20', ' '))

n_links = list()
for i in set(links):
    if need.count(i) > 0:
        for x in range(1, need.count(i) + 1):
            n_links.append(i)

print(n_links)

相关问题 更多 >