Python解析4gb以上的大数据库

% Tags relating to '217.89.104.48 - 217.89.104.63' % RIPE-USER-RESOURCE inetnum: 194.243.227.240 - 194.243.227.255 netname: PRINCESINDUSTRIEALIMENTARI remarks: INFRA-AW descr: PRINCES INDUSTRIE ALIMENTARI descr: Provider Local Registry descr: BB IBS country: IT admin-c: DUMY-RIPE tech-c: DUMY-RIPE status: ASSIGNED PA notify: order.manager2@telecomitalia.it mnt-by: INTERB-MNT changed: unread@ripe.net 20000101 source: RIPE remarks: **************************** remarks: * THIS OBJECT IS MODIFIED remarks: * Please note that all data that is generally regarded as personal remarks: * data has been removed from this object. remarks: * To view the original object, please query the RIPE Database at: remarks: * http://www.ripe.net/whois remarks: **************************** % Tags relating to '194.243.227.240 - 194.243.227.255' % RIPE-USER-RESOURCE inetnum: 194.16.216.176 - 194.16.216.183 netname: SE-CARLSTEINS descr: CARLSTEINS TRAFIK AB org: ORG-CTA17-RIPE country: SE admin-c: DUMY-RIPE tech-c: DUMY-RIPE status: ASSIGNED PA notify: mntripe@telia.net mnt-by: TELIANET-LIR changed: unread@ripe.net 20000101 source: RIPE remarks: **************************** remarks: * THIS OBJECT IS MODIFIED remarks: * Please note that all data that is generally regarded as personal remarks: * data has been removed from this object. remarks: * To view the original object, please query the RIPE Database at: remarks: * http://www.ripe.net/whois remarks: ****************************

2条回答

网友

1楼 · 编辑于 2024-06-07 08:27:11

如果您只想获得第一个描述：

r = re.compile(r''
        'descr:\s+(.*?)\n(?:descr:.*\n)*',
        re.IGNORECASE)

如果需要inetnum和first descr：

^{pr2}$

我必须承认我没有使用% Tags relating to，我假设所有{}都是连续的。在

网友

2楼 · 编辑于 2024-06-07 08:27:11

由于文件超过4gb，所以您不希望使用f.read（）一次性读取所有文件

但是使用file对象作为迭代器（当你迭代一个文件时，你会得到一行接一行）

下面的genererator应该可以完成这项工作

def parse(filename):
    current= None
    for l in open(filename):
        if l.startswith("% Tags relating to"):
            if current is not None:
                yield current
            current = {}

        elif l.startswith("inetnum:"):
            current["inetnum"] =  l.split(":",1)[1].strip()
        elif l.startswith("descr") and not "descr" in current:
            current["descr"] =  l.split(":",1)[1].strip()
    if current is not None:
        yield current

你可以把它用在下面

^{pr2}$

测试文件的结果：

{'inetnum': '194.243.227.240 - 194.243.227.255', 'descr': 'PRINCES INDUSTRIE ALIMENTARI'}
{'inetnum': '194.16.216.176 - 194.16.216.183', 'descr': 'CARLSTEINS TRAFIK AB'}

相关问题更多 >

编程相关推荐

热门问题

热门文章