匹配嵌套字典中的值

1 投票
2 回答
2135 浏览
提问于 2025-04-18 11:01

我有两个字典,它们里面还有嵌套的子字典,结构如下:

search_regions = {
    'chr11:56694718-71838208': {'Chr': 'chr11', 'End': 71838208, 'Start': 56694718},
    'chr13:27185654-39682032': {'Chr': 'chr13', 'End': 39682032, 'Start': 27185654}
}

database_variants = {
    'chr11:56694718-56694718': {'Chr': 'chr11', 'End': 56694718, 'Start': 56694718},
    'chr13:27185659-27185659': {'Chr': 'chr13', 'End': 27185659, 'Start': 27185659}
}

我需要比较这两个字典,并找出在 database_variants 中的字典,这些字典的范围要在 search_regions 中的字典范围内。

我正在写一个函数来完成这个任务(链接到之前的问题)。目前我有的代码是:

def region_to_variant_location_match(search_Variants, database_Variants):
    '''Take dictionaries for search_Variants and database_Variants as input.
    Match variants in database_Variants to regions within search_Variants.
    Return matches as a nested dictionary.'''
    #Match on Chr value
        #Where Start value from database_variant is between St and End values in 
        search_variants.
    #return as nested dictionary

我遇到的问题是,如何访问嵌套字典中的值(比如 Chr、St、End 等)来进行比较。我想用列表推导式来实现,因为我有很多数据需要处理,简单的 for 循环可能会比较耗时。

任何帮助都非常感谢!

更新

我尝试实现了下面 bioinfoboy 提出的解决方案。我的第一步是把 search_regionsdatabase_variants 字典转换成 defaultdict(list),使用了以下函数:

def search_region_converter(searchDict):
    '''This function takes the dictionary of dictionaries and converts it to a
    DefaultDict(list) to allow matching   
    with the database in a corresponding format'''
    search_regions = defaultdict(list)
    for i in search_regions.keys():
        chromosome = i.split(":")[0]
        start = int(i.split(":")[1].split("-")[0])
        end = int(i.split(":")[1].split("-")[1])
        search_regions[chromosome].append((start, end))
    return search_regions #a list with chromosomes as keys 

def database_snps_converter(databaseDict):
    '''This function takes the dictionary of dictionaries and converts it to a
    DefaultDict(list) to allow matching   
    with the serach_snps in a corresponding format'''
    database_variants = defaultdict(list)
    for i in database_variants.keys():
        chromosome = i.split(":")[0]
        start = int(i.split(":")[1].split("-")[0])
        database_variants[chromosome].append(start)
    return database_variants #list of database variants 

然后我用 bioinfoboy 的代码写了一个匹配的函数,代码如下:

def region_to_variant_location_match(search_Regions, database_Variants):
    '''Take dictionaries for search_Variants and database_Variants as 
    input.                                         
    Match variants in database_Variants to regions within search_Variants.'''
    for key, values in database_Variants.items():
        for value in values:
            for search_area in search_Regions[key]:
                print(search_area)
                if (value >= search_area[0]) and (value <= search_area[1]):
                    yield(key, search_area)

但是 defaultdict 函数返回的是空字典,我还不太明白需要改什么。

有什么想法吗?

2 个回答

1

你可能需要做类似下面的事情:

def region_to_variant_location_match(search_Variants, database_Variants):
    '''Take dictionaries for search_Variants and database_Variants as input.
    Match variants in database_Variants to regions within search_Variants.
    Return matches as a nested dictionary.'''
    return {
        record[0]: record[1]
        for record, lookup in zip(
            database_Variants.items(),
            search_Variants.items()
        )
        if (
            record[1]['Chr'] == lookup[1]['Chr'] and 
            lookup[1]['Start'] <= record[1]['Start'] <= lookup[1]['End']
        )
    }

注意,如果你使用的是Python 2.7或更早的版本(而不是Python 3),那么你应该用iteritems()来代替items(),用itertools.izip()来代替zip。如果你使用的是2.6之前的版本,你还需要使用生成器表达式传给dict(),而不是用dict表达式。

1

我想这可能会对你有帮助。

我正在根据我在评论中提到的内容,转换你的 search_regionsdatabase_variants

from collections import defaultdict
_database_variants = defaultdict(list)
_search_regions = defaultdict(list)
for i in database_variants.keys():
    _chromosome = i.split(":")[0]
    _start = int(i.split(":")[1].split("-")[0])
    _database_variants[_chromosome].append(_start)
_search_regions = defaultdict(list)
for i in search_regions.keys():
    _chromosome = i.split(":")[0]
    _start = int(i.split(":")[1].split("-")[0])
    _end = int(i.split(":")[1].split("-")[1])
    _search_regions[_chromosome].append((_start, _end))

def _search(_database_variants, _search_regions):
    for key, values in _database_variants.items():
        for value in values:
            for search_area in _search_regions[key]:
                if (value >= search_area[0]) and (value <= search_area[1]):
                    yield(key, search_area)

我使用了 yield,所以会返回一个生成器对象,你可以在这个对象上进行迭代。根据你在问题中最初提供的数据,我得到了以下输出。

for i in _search(_database_variants, _search_regions):
    print(i)

输出结果如下:

('chr11', (56694718, 71838208))
('chr13', (27185654, 39682032))

这不是你想要实现的效果吗?

撰写回答