BeautifulSoup生成不一致的结果

sidebar_urls = [] for i in range(0, len(reddit_urls)): req = urllib.request.Request(reddit_urls[i], headers=headers) resp = urllib.request.urlopen(req) soup = BeautifulSoup(resp, 'html.parser') links = soup.find_all(href=True) for link in links: if "XYZ.com" in str(link['href']): sidebar_urls.append(link['href'])

1条回答

网友

1楼 · 发布于 2024-05-15 15:22:04

似乎你有时会得到一个没有侧边栏的页面。这可能是因为Reddit将您识别为机器人，并返回一个默认页面，而不是您期望的页面。请考虑在请求页面时使用User-Agent字段标识自己：

reddit_urls = [
    "https://www.reddit.com/r/leagueoflegends/",
    "https://www.reddit.com/r/pokemon/"
]

# Update this to identify yourself
user_agent = "me@example.com"

sidebar_urls = []
for reddit_url in reddit_urls:
    response = requests.get(reddit_url, headers={"User-Agent": user_agent})
    soup = BeautifulSoup(response.text, "html.parser")

    # Find the sidebar tag
    side_tag = soup.find("div", {"class": "side"})
    if side_tag is None:
        print("Could not find a sidebar in page: {}".format(reddit_url))
        continue

    # Find all links in the sidebar tag
    link_tags = side_tag.find_all("a")
    for link in link_tags:
        link_text = str(link["href"])
        sidebar_urls.append(link_text)

print(sidebar_urls)

相关问题更多 >

编程相关推荐

热门问题

热门文章

BeautifulSoup生成不一致的结果

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >