多线程网页抓取器将值存储到Pandas数据框

1 投票
1 回答
754 浏览
提问于 2025-04-18 06:13

我用Python 3.4写了一个网络爬虫,目的是从网页上提取数据,并把这些数据存储为Pandas数据框(在保存到Excel文件之前)。这个过程现在运行得很顺利。不过,我想抓取的网页数量已经超过10万,这就意味着我现在的单线程方法花费的时间太长了。

我需要抓取的网页链接是提前知道的;下面是我当前代码的一部分:

For a in UrlRefs: # This is the masterlist of URLs to scrape (there are around 5000)
         for i in SecondRefs.index # For each of the main pages, there are 29 subpages. The subpage references are contained in a Pandas dataframe 'SecondRefs' along with a description of the webpage. 

             #the OpenPages function actually performs the webscraping (using BeautifulSoup to parse the pages)
             TempDF = OpenPages(a, SecondRefs.iloc[i,0],SecondRefs.iloc[i,1])
             MainDF=pd.concat([MainDF,Players],ignore_index=True)

我之前没有使用过Python的多线程,但我猜测这29个子页面可以用多线程来处理。我尝试这样做:

#code modified from http://www.quantstart.com/articles/Parallelising-Python-with-Threading-and-Multiprocessing
jobs=[]
        for i in range(1, 29):
            thread = threading.Thread(target=OpenPages(a,SecondRefs.iloc[i,0],SecondRefs.iloc[i,1]))
            jobs.append(thread)
            print("thread ",i)
        # Start the threads  
        for j in jobs:
            j.start()

        #Ensure all of the threads have finished
        for j in jobs:
            j.join()

        MainDF=pd.concat([MainDF,jobs],ignore_index=True)

不过,上面的做法在运行时出现了错误,而且似乎并没有加快处理速度(创建线程大约花了20秒,之后就出现了错误;即使不考虑错误,这个速度也和单线程差不多)。

我有以下几个问题:

1) 我该如何在代码中最好地实现多线程,以加快处理速度?

2) 然后我该如何把每个线程返回的值合并回主Pandas数据框中?

感谢大家的帮助。

补充:

谢谢大家的回复。

顺便说一下,我已经找到了一个解决我问题的方法(当然,如果你们有意见或建议,欢迎随时评论)。

from multiprocessing.pool import ThreadPool
from random import randrange
 pool = ThreadPool(processes=29)
        thread=[]
        for i in range(1, 29):
            thread.append(pool.apply_async(OpenPages, (a,SecondRefs.iloc[i,0],SecondRefs[i,1])))
            time.sleep(randrange(25,35)/100)

        # Start the threads    
        for i in range(1, 29):
            MainDF=pd.concat([MainDF,thread[i].get()],ignore_index=True)

        pool.close()

1 个回答

0
thread = threading.Thread(target=OpenPages(a,SecondRefs.iloc[i,0],SecondRefs.iloc[i,1]))
thread = threading.Thread(target=OpenPages, args=(a,SecondRefs.iloc[i,0],SecondRefs.iloc[i,1]))

应该是

撰写回答