我正在分析一些网页使用beauthoulsoup4和多线程。每个线程从队列中获取作业(url)并调用parse_results_page
,后者发出一个HTTP请求(调用get_html_page_source()
),然后在for循环中解析页面上的每个项目。在
我注意到: 如果我有5个线程,解析一个页面需要0.7-8秒。 如果我有50个线程,解析一个页面需要25-60秒。 即使有10个线程,运行时间也会增加很多。在
*执行是在HTTP请求发出后进行测量的,因此差异不是因为我的带宽或服务器的响应时间。在
另外,拥有50个线程可以将内存使用量提高到6GB。在
有人能解释一下为什么会发生这种情况,以及我如何为更多的线程优化我的代码吗?在
下面是我的代码。 谢谢
def parse_item(self, item, *args):
"""Parse a given item
@item - beautifulsoup.Tag object
@queue - queue of dicts
"""
item_url = item.find('h2', class_="heading")
item_url = re.sub(r'\.html?.+$', '.html', item_url.a['href'])
if 'redirect' in item_url or '/external/url' in item_url:
return
item_title = item.find('span', class_="mp-listing-title")
item_title = item_title.get_text().strip() if item_title else None
description = item.find('div', class_="listing-title-description")
if description:
description.h2.extract()
description = description.get_text().strip()
description = ' '.join(description.split())
price = item.find('span', class_="price")
if price:
try:
raw_price = price.get_text().strip().replace(
'.', '').replace(',', '.')
currency, price = raw_price.split()
price = price
try:
currency = self.currency_code[currency.strip()]
except KeyError:
currency = None
price = raw_price
except ValueError:
price = raw_price
currency = None
else:
price, currency = None, None
seller_id = item.find('div', class_="seller-name")
if seller_id:
seller_id = seller_id.find('a')
if seller_id:
seller_id = seller_id['href'].split('/')[-1].replace('.html',
'')
location = item.find('div', class_='location-name')
if location:
location = location.get_text().split(',')
if len(location) == 1:
city = location[0]
region = None
else:
city = location[0]
region = location[1].strip()
else:
city, region = None, None
date_posted = ''
date_string = item.find('div', class_="date"
).get_text().strip()
if 'Vandaag' in date_string:
date_posted = datetime.datetime.now().date().strftime(
'%Y-%m-%d')
elif 'Gisteren' in date_string:
date_posted = datetime.datetime.now(
).date() - datetime.timedelta(1)
date_posted = date_posted.strftime('%Y-%m-%d')
elif 'Eergisteren' in date_string:
date_posted = datetime.datetime.now(
).date() - datetime.timedelta(2)
date_posted = date_posted.strftime('%Y-%m-%d')
else:
date_posted = datetime.datetime.strptime(date_string.replace(
'.', ''),
"%d %b '%y")
date_posted = date_posted.strftime('%Y-%m-%d')
date_scraped = datetime.datetime.now().strftime(
'%Y-%m-%d %H:%M:%S')
return {"Url": item_url,
'Category': args[0],
"AdTitle": item_title,
"Subcategory": args[1],
"Description": description,
"Currency": currency,
"Price": price,
"SellerId": seller_id,
"SellerPhone": None,
"SellerWebsite": None,
"SellerEmail": None,
"SellerCity": city,
"SellerCountry": None,
"SellerRegion": region,
"DatePosted": date_posted,
"DateSaved": date_scraped}
def parse_results_page(self, url, queue):
"""Get all items from a given page
@url - string
@queue - output queue
"""
page = self.get_html_page_source(url)
if not page:
return
page = BeautifulSoup(page, "html5lib")
items = page.find_all('article')
breadcrumb = page.find('ul', class_="breadcrumbs").find_all('span')
category = breadcrumb[0].get_text().strip()
subcategory = breadcrumb[1].get_text().strip()
for item in items:
queue.put(self.parse_item(item, category, subcategory))
目前没有回答
相关问题 更多 >
编程相关推荐