无法从具有不同深度的某些链接中解析产品名称

3条回答

网友

1楼 · 编辑于 2024-04-23 14:01:52

我建议你从页面站点地图开始抓取

Found here

如果他们添加产品，很可能也会出现在这里。在

网友

2楼 · 编辑于 2024-04-23 14:01:52

由于您的主要问题是查找链接，下面是一个生成器，它将使用krflol在其解决方案中指出的sitemap来查找所有类别和子类别链接：

from bs4 import BeautifulSoup
import requests


def category_urls():
    response = requests.get('https://www.courts.com.sg/sitemap')
    html_soup = BeautifulSoup(response.text, features='html.parser')
    categories_sitemap = html_soup.find(attrs={'class': 'xsitemap-categories'})

    for category_a_tag in categories_sitemap.find_all('a'):
        yield category_a_tag.attrs['href']

要找到产品名称，只需将生成的category_urls中的每一个都擦掉。在

网友

3楼 · 编辑于 2024-04-23 14:01:52

该网站有六个主要的产品类别。属于子类别的产品也可以在主类别中找到（例如，/furniture/furniture/tables中的产品也可以在/furniture中找到），因此您只需要从主类别中收集产品。你可以从主页上获得分类链接，但是使用站点地图会更容易。在

url = 'https://www.courts.com.sg/sitemap/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
cats = soup.select('li.level-0.category > a')[:6]
links = [i['href'] for i in cats]

正如您所提到的，有些链接有不同的结构，比如这个：/televisions。但是，如果您单击该页面上的View All Products链接，您将被重定向到/tv-entertainment/vision/television。因此，您可以从/tv-entertainment获取所有/televisions个管道。同样，在品牌链接中的产品也可以在主要类别中找到。例如，/asus产品可以在/computing-mobile和其他类别中找到。在

下面的代码收集所有主要类别的产品，因此它应该收集网站上的所有产品。在

^{pr2}$

我已经将每页的产品数量增加到24个，但是这段代码仍然需要很多时间，因为它收集了所有主要类别的产品及其分页链接。然而，我们可以使用threads来加快速度。在

from bs4 import BeautifulSoup
import requests
from threading import Thread, Lock
from urllib.parse import urlparse, parse_qs

lock = Lock()
threads = 10
products = []

def get_products(link, products):
    soup = BeautifulSoup(requests.get(link).text, 'html.parser')
    tags = soup.select(".product-item-info .product-item-link")
    with lock:
        products += [tag.get_text(strip=True) for tag in tags]
        print('page:', link, 'items:', len(tags))

url = 'https://www.courts.com.sg/sitemap/'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
cats = soup.select('li.level-0.category > a')[:6]
links = [i['href'] for i in cats]

for link in links:
    link += '?product_list_limit=24'
    soup = BeautifulSoup(requests.get(link).text, 'html.parser')
    last_page = soup.select_one('a.page.last')['href']
    last_page = int(parse_qs(urlparse(last_page).query)['p'][0])
    threads_list = []

    for i in range(1, last_page + 1):
        page = '{}&p={}'.format(link, i)
        thread = Thread(target=get_products, args=(page, products))
        thread.start()
        threads_list += [thread]
        if i % threads == 0 or i == last_page:
            for t in threads_list:
                t.join()

print(len(products))
print('\n'.join(products))

这个代码在大约5分钟内从773页中收集了18466个产品。我使用10个线程，因为我不想给服务器带来太多压力，但您可以使用更多线程（大多数服务器可以轻松处理20个线程）。在

相关问题更多 >

编程相关推荐

热门问题

热门文章