Python美化组和多线程性能

2024-04-26 20:46:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在分析一些网页使用beauthoulsoup4和多线程。每个线程从队列中获取作业(url)并调用parse_results_page,后者发出一个HTTP请求(调用get_html_page_source()),然后在for循环中解析页面上的每个项目。在

我注意到: 如果我有5个线程,解析一个页面需要0.7-8秒。 如果我有50个线程,解析一个页面需要25-60秒。 即使有10个线程,运行时间也会增加很多。在

*执行是在HTTP请求发出后进行测量的,因此差异不是因为我的带宽或服务器的响应时间。在

另外,拥有50个线程可以将内存使用量提高到6GB。在

有人能解释一下为什么会发生这种情况,以及我如何为更多的线程优化我的代码吗?在

下面是我的代码。 谢谢

    def parse_item(self, item, *args):
    """Parse a given item
    @item - beautifulsoup.Tag object
    @queue - queue of dicts
    """
    item_url = item.find('h2', class_="heading")
    item_url = re.sub(r'\.html?.+$', '.html', item_url.a['href'])
    if 'redirect' in item_url or '/external/url' in item_url:
        return

    item_title = item.find('span', class_="mp-listing-title")
    item_title = item_title.get_text().strip() if item_title else None

    description = item.find('div', class_="listing-title-description")
    if description:
        description.h2.extract()
        description = description.get_text().strip()
        description = ' '.join(description.split())

    price = item.find('span', class_="price")
    if price:
        try:
            raw_price = price.get_text().strip().replace(
                                            '.', '').replace(',', '.')
            currency, price = raw_price.split()
            price = price
            try:
                currency = self.currency_code[currency.strip()]
            except KeyError:
                currency = None
                price = raw_price
        except ValueError:
            price = raw_price
            currency = None
    else:
        price, currency = None, None

    seller_id = item.find('div', class_="seller-name")
    if seller_id:
        seller_id = seller_id.find('a')
    if seller_id:
        seller_id = seller_id['href'].split('/')[-1].replace('.html',
                                                             '')
    location = item.find('div', class_='location-name')
    if location:
        location = location.get_text().split(',')
        if len(location) == 1:
            city = location[0]
            region = None
        else:
            city = location[0]
            region = location[1].strip()
    else:
        city, region = None, None
    date_posted = ''
    date_string = item.find('div', class_="date"
                            ).get_text().strip()
    if 'Vandaag' in date_string:
        date_posted = datetime.datetime.now().date().strftime(
                                                        '%Y-%m-%d')

    elif 'Gisteren' in date_string:
        date_posted = datetime.datetime.now(
                                    ).date() - datetime.timedelta(1)
        date_posted = date_posted.strftime('%Y-%m-%d')

    elif 'Eergisteren' in date_string:
        date_posted = datetime.datetime.now(
                                    ).date() - datetime.timedelta(2)
        date_posted = date_posted.strftime('%Y-%m-%d')

    else:
        date_posted = datetime.datetime.strptime(date_string.replace(
                                                            '.', ''),
                                                 "%d %b '%y")

        date_posted = date_posted.strftime('%Y-%m-%d')
    date_scraped = datetime.datetime.now().strftime(
                                                '%Y-%m-%d %H:%M:%S')
    return {"Url": item_url,
            'Category': args[0],
            "AdTitle": item_title,
            "Subcategory": args[1],
            "Description": description,
            "Currency": currency,
            "Price": price,
            "SellerId": seller_id,
            "SellerPhone": None,
            "SellerWebsite": None,
            "SellerEmail": None,
            "SellerCity": city,
            "SellerCountry": None,
            "SellerRegion": region,
            "DatePosted": date_posted,
            "DateSaved": date_scraped}

def parse_results_page(self, url, queue):
    """Get all items from a given page
    @url - string
    @queue - output queue
    """

    page = self.get_html_page_source(url)
    if not page:
        return
    page = BeautifulSoup(page, "html5lib")
    items = page.find_all('article')
    breadcrumb = page.find('ul', class_="breadcrumbs").find_all('span')
    category = breadcrumb[0].get_text().strip()
    subcategory = breadcrumb[1].get_text().strip()
    for item in items:
        queue.put(self.parse_item(item, category, subcategory))

Tags: noneurlgetdatetimedateifpagelocation