多线程网页抓取器将值存储到Pandas数据框
我用Python 3.4写了一个网络爬虫,目的是从网页上提取数据,并把这些数据存储为Pandas数据框(在保存到Excel文件之前)。这个过程现在运行得很顺利。不过,我想抓取的网页数量已经超过10万,这就意味着我现在的单线程方法花费的时间太长了。
我需要抓取的网页链接是提前知道的;下面是我当前代码的一部分:
For a in UrlRefs: # This is the masterlist of URLs to scrape (there are around 5000)
for i in SecondRefs.index # For each of the main pages, there are 29 subpages. The subpage references are contained in a Pandas dataframe 'SecondRefs' along with a description of the webpage.
#the OpenPages function actually performs the webscraping (using BeautifulSoup to parse the pages)
TempDF = OpenPages(a, SecondRefs.iloc[i,0],SecondRefs.iloc[i,1])
MainDF=pd.concat([MainDF,Players],ignore_index=True)
我之前没有使用过Python的多线程,但我猜测这29个子页面可以用多线程来处理。我尝试这样做:
#code modified from http://www.quantstart.com/articles/Parallelising-Python-with-Threading-and-Multiprocessing
jobs=[]
for i in range(1, 29):
thread = threading.Thread(target=OpenPages(a,SecondRefs.iloc[i,0],SecondRefs.iloc[i,1]))
jobs.append(thread)
print("thread ",i)
# Start the threads
for j in jobs:
j.start()
#Ensure all of the threads have finished
for j in jobs:
j.join()
MainDF=pd.concat([MainDF,jobs],ignore_index=True)
不过,上面的做法在运行时出现了错误,而且似乎并没有加快处理速度(创建线程大约花了20秒,之后就出现了错误;即使不考虑错误,这个速度也和单线程差不多)。
我有以下几个问题:
1) 我该如何在代码中最好地实现多线程,以加快处理速度?
2) 然后我该如何把每个线程返回的值合并回主Pandas数据框中?
感谢大家的帮助。
补充:
谢谢大家的回复。
顺便说一下,我已经找到了一个解决我问题的方法(当然,如果你们有意见或建议,欢迎随时评论)。
from multiprocessing.pool import ThreadPool
from random import randrange
pool = ThreadPool(processes=29)
thread=[]
for i in range(1, 29):
thread.append(pool.apply_async(OpenPages, (a,SecondRefs.iloc[i,0],SecondRefs[i,1])))
time.sleep(randrange(25,35)/100)
# Start the threads
for i in range(1, 29):
MainDF=pd.concat([MainDF,thread[i].get()],ignore_index=True)
pool.close()
1 个回答
0
thread = threading.Thread(target=OpenPages(a,SecondRefs.iloc[i,0],SecondRefs.iloc[i,1]))
thread = threading.Thread(target=OpenPages, args=(a,SecondRefs.iloc[i,0],SecondRefs.iloc[i,1]))
应该是