多线程网页抓取器将值存储到Pandas数据框

Question

我用Python 3.4写了一个网络爬虫，目的是从网页上提取数据，并把这些数据存储为Pandas数据框（在保存到Excel文件之前）。这个过程现在运行得很顺利。不过，我想抓取的网页数量已经超过10万，这就意味着我现在的单线程方法花费的时间太长了。

我需要抓取的网页链接是提前知道的；下面是我当前代码的一部分：

For a in UrlRefs: # This is the masterlist of URLs to scrape (there are around 5000)
         for i in SecondRefs.index # For each of the main pages, there are 29 subpages. The subpage references are contained in a Pandas dataframe 'SecondRefs' along with a description of the webpage. 

             #the OpenPages function actually performs the webscraping (using BeautifulSoup to parse the pages)
             TempDF = OpenPages(a, SecondRefs.iloc[i,0],SecondRefs.iloc[i,1])
             MainDF=pd.concat([MainDF,Players],ignore_index=True)

我之前没有使用过Python的多线程，但我猜测这29个子页面可以用多线程来处理。我尝试这样做：

#code modified from http://www.quantstart.com/articles/Parallelising-Python-with-Threading-and-Multiprocessing
jobs=[]
        for i in range(1, 29):
            thread = threading.Thread(target=OpenPages(a,SecondRefs.iloc[i,0],SecondRefs.iloc[i,1]))
            jobs.append(thread)
            print("thread ",i)
        # Start the threads  
        for j in jobs:
            j.start()

        #Ensure all of the threads have finished
        for j in jobs:
            j.join()

        MainDF=pd.concat([MainDF,jobs],ignore_index=True)

不过，上面的做法在运行时出现了错误，而且似乎并没有加快处理速度（创建线程大约花了20秒，之后就出现了错误；即使不考虑错误，这个速度也和单线程差不多）。

我有以下几个问题：

1) 我该如何在代码中最好地实现多线程，以加快处理速度？

2) 然后我该如何把每个线程返回的值合并回主Pandas数据框中？

感谢大家的帮助。

补充：

谢谢大家的回复。

顺便说一下，我已经找到了一个解决我问题的方法（当然，如果你们有意见或建议，欢迎随时评论）。

from multiprocessing.pool import ThreadPool
from random import randrange
 pool = ThreadPool(processes=29)
        thread=[]
        for i in range(1, 29):
            thread.append(pool.apply_async(OpenPages, (a,SecondRefs.iloc[i,0],SecondRefs[i,1])))
            time.sleep(randrange(25,35)/100)

        # Start the threads    
        for i in range(1, 29):
            MainDF=pd.concat([MainDF,thread[i].get()],ignore_index=True)

        pool.close()

多线程并发编程数据提取网页抓取网络爬虫处理速度 pandas 数据框

多线程网页抓取器将值存储到Pandas数据框

1 个回答

撰写回答