在特定时间后从HTML中获取链接

<html> <body> <p class="fixedfonts"> <a href="A.pdf">LINK1</a> </p> <h2>Results</h2> <p class="fixedfonts"> <a href="B.pdf">LINK2</a> </p> <p class="fixedfonts"> <a href="C.pdf">LINK3</a> </p> </body> </html>

from bs4 import BeautifulSoup, SoupStrainer # at this point html contains the code as string # parse the HTML file soup = BeautifulSoup(html.replace('\n', ''), parse_only=SoupStrainer('a')) # kill all script and style elements for script in soup(["script", "style"]): script.extract() # rip it out links = list() for link in soup: if link.has_attr('href'): links.append(link['href'].replace('%20', ' ')) print(links)

3条回答

网友

1楼 · 编辑于 2024-05-19 02:12:15

将html数据分成两部分，在“结果”之前和之后，然后使用后面的一部分来处理它：

data = html.split("Results")
need = data[1]

所以只要实现这一点：

from bs4 import BeautifulSoup, SoupStrainer
data = html.split("Results")
need = data[1]
soup = BeautifulSoup(need.replace('\n', ''), parse_only=SoupStrainer('a'))

网友

2楼 · 编辑于 2024-05-19 02:12:15

您可以使用^{} method来解决这个问题：

results = soup.find("h2", text="Results")
for link in results.find_all_next("a"):
    print(link.get("href"))

演示：

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <html>
...     <body>
...         <p class="fixedfonts">
...             <a href="A.pdf">LINK1</a>
...         </p>
... 
...         <h2>Results</h2>
... 
...         <p class="fixedfonts">
...             <a href="B.pdf">LINK2</a>
...         </p>
... 
...         <p class="fixedfonts">
...             <a href="C.pdf">LINK3</a>
...         </p>
...     </body>
... </html>"""
>>> 
>>> soup = BeautifulSoup(data, "html.parser")
>>> results = soup.find("h2", text="Results")
>>> for link in results.find_all_next("a"):
...     print(link.get("href"))
... 
B.pdf
C.pdf

网友

3楼 · 编辑于 2024-05-19 02:12:15

测试和似乎工作。你知道吗

from bs4 import BeautifulSoup, SoupStrainer

html = '''<html>
<body>
    <p class="fixedfonts">
        <a href="A.pdf">LINK1</a>
    </p>

    <h2>Results</h2>

    <p class="fixedfonts">
        <a href="B.pdf">LINK2</a>
    </p>

    <p class="fixedfonts">
        <a href="B.pdf">LINK2</a>
    </p>

    <p class="fixedfonts">
        <a href="C.pdf">LINK3</a>
    </p>
</body>
</html>'''

# at this point html contains the code as string

# parse the HTML file
dat = html.split("Result")
need = dat[1]
soup = BeautifulSoup(html.replace('\n', ''), parse_only=SoupStrainer('a'))

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()  # rip it out

links = list()
for link in soup:
    if link.has_attr('href'):
         links.append(link['href'].replace('%20', ' '))

n_links = list()
for i in set(links):
    if need.count(i) > 0:
        for x in range(1, need.count(i) + 1):
            n_links.append(i)

print(n_links)

相关问题更多 >

编程相关推荐

热门问题

热门文章