Python爬一页

2024-03-29 11:10:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试提取以特定单词开头的链接(href),但它返回空列表即使我在页面源代码中有很多满足条件的链接,我肯定遗漏了一些内容,下面是我的代码:

import requests
from bs4 import BeautifulSoup
import string
import os
import re

def extract_href_page(page):
    soup = BeautifulSoup(page)

    all_links = []
    links = soup.find_all('a', pattern = re.compile(r'\w*first_word'))
    # pattern = re.compile(r'\w*recette')
    print(links)
    for link in links:
          all_links.append(link['href'])  # Save href only, for example.
    return all_links

for page_number in range(1, 63):
    requete = requests.get ("https://www.website.com/pages/"+ "page".capitalize()+ "-" + str(page_number)  + ".html")
    page = requete.content
    list_links = extract_href_page(page)
    print(list_links)
    for link in list_links:
         print(link)

Tags: inimportrefor链接pagelinkextract
1条回答
网友
1楼 · 发布于 2024-03-29 11:10:33

试试这个:

import requests 
from bs4 import BeautifulSoup 
import string 
import os 
import re 
def extract_href_page(page): 
    soup = BeautifulSoup(page)  
    all_links = [] 
    links = soup.find_all('a', href=True) 
    # pattern = re.compile(r'\w*recette') 
    print(links) 
    for link in links: 
        if re.match(r"\w*first_word", link["href"], re.I):
            all_links.append(link.get("href"))
...

相关问题 更多 >