如何使用BeautifulSoup从论坛中刮取特定图像（不包括缩略图、图标等）

import requests from bs4 import BeautifulSoup def spider(max_pages): page = 1 while page <= max_pages: url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page) sourcecode= requests.get(url) plaintext = sourcecode.text soup = BeautifulSoup(plaintext) for link in soup.findAll('img src'): print (link) page += 1 spider(1)

1条回答

网友

1楼 · 发布于 2024-05-20 00:05:45

尝试使用tag.get('src')而不是soup.findAll('img src')：

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
        sourcecode= requests.get(url)
        plaintext = sourcecode.text
        soup = BeautifulSoup(plaintext)

        for tag in soup.findAll('img'): 
            print(tag.get('src'))   # use `tag.get('src')` in this case

        page += 1
spider(1)

请查看the document了解更多详细信息

如果需要下载，还可以使用^{}下载图像的内容，并将其写入文件。下面是一个演示：

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
        sourcecode= requests.get(url)
        plaintext = sourcecode.text
        soup = BeautifulSoup(plaintext)

        for tag in soup.findAll('img'):
            link = tag.get('src')  # get the link

            # Check if the tag is in expect format
            del tag['src']
            if tag.attrs != {';': '', 'alt': '', 'border': '0'}:
                continue

            filename = link.strip('/').rsplit('/', 1)[-1]  # to get the correct file name

            image = requests.get(link).content  # use requests to get the content of the images
            with open(filename, 'wb') as f:
                f.write(image)  # write the image into a file

        page += 1
spider(1)

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用BeautifulSoup从论坛中刮取特定图像（不包括缩略图、图标等）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >