Python解析4gb以上的大数据库

2024-06-07 08:27:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图用python解析一个超过4gb的db文件。在

数据库文件中的示例:

% Tags relating to '217.89.104.48 - 217.89.104.63'
% RIPE-USER-RESOURCE

inetnum:        194.243.227.240 - 194.243.227.255
netname:        PRINCESINDUSTRIEALIMENTARI
remarks:        INFRA-AW
descr:          PRINCES INDUSTRIE ALIMENTARI
descr:          Provider Local Registry
descr:          BB IBS
country:        IT
admin-c:        DUMY-RIPE
tech-c:         DUMY-RIPE
status:         ASSIGNED PA
notify:         order.manager2@telecomitalia.it
mnt-by:         INTERB-MNT
changed:        unread@ripe.net 20000101
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

% Tags relating to '194.243.227.240 - 194.243.227.255'
% RIPE-USER-RESOURCE

inetnum:        194.16.216.176 - 194.16.216.183
netname:        SE-CARLSTEINS
descr:          CARLSTEINS TRAFIK AB
org:            ORG-CTA17-RIPE
country:        SE
admin-c:        DUMY-RIPE
tech-c:         DUMY-RIPE
status:         ASSIGNED PA
notify:         mntripe@telia.net
mnt-by:         TELIANET-LIR
changed:        unread@ripe.net 20000101
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

我想解析每个以% Tags relating to开头的块

从这个块中,我要提取inetnum和第一个descr

这是我目前得到的:(更新)

^{pr2}$

Tags: 文件thetodatanetthatobjecttags
2条回答

如果您只想获得第一个描述:

r = re.compile(r''
        'descr:\s+(.*?)\n(?:descr:.*\n)*',
        re.IGNORECASE)

如果需要inetnum和first descr:

^{pr2}$

我必须承认我没有使用% Tags relating to,我假设所有{}都是连续的。在

由于文件超过4gb,所以您不希望使用f.read()一次性读取所有文件

但是使用file对象作为迭代器(当你迭代一个文件时,你会得到一行接一行)

下面的genererator应该可以完成这项工作

def parse(filename):
    current= None
    for l in open(filename):
        if l.startswith("% Tags relating to"):
            if current is not None:
                yield current
            current = {}

        elif l.startswith("inetnum:"):
            current["inetnum"] =  l.split(":",1)[1].strip()
        elif l.startswith("descr") and not "descr" in current:
            current["descr"] =  l.split(":",1)[1].strip()
    if current is not None:
        yield current

你可以把它用在下面

^{pr2}$

测试文件的结果:

{'inetnum': '194.243.227.240 - 194.243.227.255', 'descr': 'PRINCES INDUSTRIE ALIMENTARI'}
{'inetnum': '194.16.216.176 - 194.16.216.183', 'descr': 'CARLSTEINS TRAFIK AB'}

相关问题 更多 >

    热门问题