为什么mongoDB导入一百万个html页面这么慢？

2024-05-19 01:06:27 发布

您现在位置：Python中文网/ 问答频道 /正文

7800

网友

男 | 程序猿一只，喜欢编程写python代码。

我是MongoDB的新手。你知道吗

最近，我爬了250万个网页用于个人培训。每页大约0.1 MB，并保存为html文件在我的硬盘上。你知道吗

我计划将所有html文件导入mongodb进行进一步处理，为了加快导入速度，我在python中使用了多处理模型。这是我的密码：



    #!/usr/bin/env python
    # -*- coding: utf-8 -*-

    import re, sys, os, time, subprocess
    from pymongo import MongoClient
    from multiprocessing.dummy import Pool


    ##connect to default MongoClient
    client = MongoClient()
    ## Link to a database, if the database not exist, create it.
    db = client['mongoF']
    ## create a collection (like a table)
    collection = db['HTMLs']

    shellMSG = subprocess.check_output("find /home/pages_8 -type f", shell=True)
    html_ls =  shellMSG.split('\n')

    def html2mongo(html):
        item = {
            "_id"  : html[-13:-5],
            "category"  : re.search("pages_.*?/(.*)/", html).group(1).split('/'),
            "bs"        : open(html).read()
        }
        collection.insert(item)
        msg = "\rImported " + str(collection.count()) + " files!"
        sys.stdout.write(msg); sys.stdout.flush()  

    t1 = time.time()

    pool = Pool(16)
    pool.map(html2mongo, html_ls) 
    pool.close()  
    pool.join()

    print "\nImporting used " + str(int(time.time() - t1)) + " seconds in total!\n"

我的电脑是现代游戏电脑，而且速度很快。CPU有8个核心。根据这个blog，我将池数设置为16。你知道吗

当执行上述脚本时，我发现只导入了130万个文件，因为存储mongoDB数据库的磁盘上没有剩余的空间。但是上面的处理花了7个多小时，我觉得很慢。你知道吗

我还发现虚拟内存非常巨大：372GB！你知道吗

Huge VIRT value showed by htop

我的问题是： 1导入一百万个这样的文件的正常速度是多少？ 2我应该用什么方法来改善进口？你知道吗

敬礼！你知道吗

Tags：文件 to from import re client time html

0条回答

目前没有回答

为什么mongoDB导入一百万个html页面这么慢？

相关问题更多 >

编程相关推荐

热门问题

热门文章

为什么mongoDB导入一百万个html页面这么慢？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >