在字典范围内查找值 - python

0 投票

5 回答

3644 浏览

提问于 2025-04-17 18:58

我正在比较两个文件，第一个文件有一个标识符列、起始值和结束值。第二个文件包含对应的标识符和另一个值列。

举个例子：

文件1：

A     200     900
A     1000    1200
B     100     700
B     900     1000

文件2：

我想找出第二个文件中所有在第一个文件范围内的值，这样我的输出结果会是：

目前我已经从第一个文件创建了一个字典，里面有一个范围的列表：

举个例子：

if Identifier in Dictionary:
    Dictionary[Identifier].extend(range(Start, (End+1)))
else:
    Dictionary[Identifier] = range(Start, (End+1))

然后我会遍历第二个文件，查找这个字典中的值：

举个例子：

if Identifier in Dictionary:
    if Value in Dictionary[Identifier]:
    OutFile.write(Line + "\n")

虽然这种方法对于相对较小的文件有效，但我有几个大文件，这个程序运行起来非常慢。我需要优化我的程序，让它运行得更快。

性能优化数据处理字典查找文件比较大数据数据范围迭代算法识别符

5 个回答

因为你的数据范围很大，而你的问题基本上就是一堆比较，所以存储一个开始和结束的元组（也就是两个数字）几乎肯定比存储整个范围要快（尤其是如果有两个范围重叠的话，你现在的方式会重复存储很多数字）。

# Building the dict
if not ident in d:
    d[ident] = (lo, hi)
else:
    old_lo, old_hi = d[ident]
    d[ident] = (min(lo, old_lo), max(hi, old_hi))

那么你的比较就可以变得简单：

# comparing...
if ident in d:
    if d[ident][0] <= val <= d[ident][1]:
        outfile.write(line+'\n')

如果你不单独检查 if ident in d，这两部分的速度都会更快。Python 的字典非常快速，所以一开始就直接调用它就行。你可以给字典设置默认值，所以要好好利用这个功能。我没有具体测试过这个方法的速度提升，但肯定会有一些效果，而且这个方法确实有效：

# These both make use of the following somewhat silly hack:
# In Python, None is treated as less than everything (even -float('inf))
# and empty containers (e.g. (), [], {}) are treated as greater than everything.
# So we use the tuple ((), None) as if it was (float('inf'), float('-inf))

for line in file1:
    ident, lo, hi = line.split()
    lo = int(lo)
    hi = int(hi)
    old_lo, old_hi = d.get(ident, ((), None))
    d[ident] = (min(lo, old_lo), max(hi, old_hi))

# comparing:
for line in file2:
    ident, val = line.split()
    val = int(val)
    lo, hi = d.get(ident, ((), None))
    if lo <= val <= hi:
        outfile.write(line) # unless you stripped it off, this still has a \n

上面的代码是我用来测试的；它在一个包含一百万行的 file2 文件上运行只需要几秒钟。

回答于 2025-04-17 由 Python大师

分享举报

你可以试试这样做：

In [27]: ranges=defaultdict(list)

In [28]: with open("file1") as f:
    for line in f:
        name,st,end=line.split()
        st,end=int(st),int(end)
        ranges[name].append([st,end])
   ....:         

In [30]: ranges
Out[30]: defaultdict(<type 'list'>, {'A': [[200, 900], [1000, 1200]], 'B': [[100, 700], [900, 1000]]})

In [29]: with open("file2") as f:
    for line in f:
        name,val=line.split()
        val=int(val)
        if any(y[0]<=val<=y[1] for y in ranges[name]):
            print name,val
   ....:             
A 200
A 250
B 100
B 150

回答于 2025-04-17 由 Python大师

分享举报

from collections import defaultdict
ident_ranges = defaultdict(list)
with open('file1.txt', 'r') as f1
    for row in f1:
        ident, start, end = row.split()
        start, end = int(start), int(end)
        ident_ranges[ident].append((start, end))
with open('file2.txt', 'r') as f2, open('out.txt', 'w') as output:  
    for line in f2:
        ident, value = line.split()
        value = int(value)
        if any(start <= value <= end for start, end in ident_ranges[ident]):
            output.write(line)

注意事项: 使用 defaultdict 可以让你在往字典里添加范围时，不用先检查这个键是否存在。使用 any 可以让范围检查更高效，避免不必要的计算。链式比较是Python中的一个很方便的语法快捷方式，比如 start <= value <= end 这样的写法。

回答于 2025-04-17 由 Python大师

分享举报

在字典范围内查找值 - python

5 个回答

撰写回答