匹配嵌套字典中的值
我有两个字典,它们里面还有嵌套的子字典,结构如下:
search_regions = {
'chr11:56694718-71838208': {'Chr': 'chr11', 'End': 71838208, 'Start': 56694718},
'chr13:27185654-39682032': {'Chr': 'chr13', 'End': 39682032, 'Start': 27185654}
}
database_variants = {
'chr11:56694718-56694718': {'Chr': 'chr11', 'End': 56694718, 'Start': 56694718},
'chr13:27185659-27185659': {'Chr': 'chr13', 'End': 27185659, 'Start': 27185659}
}
我需要比较这两个字典,并找出在 database_variants
中的字典,这些字典的范围要在 search_regions
中的字典范围内。
我正在写一个函数来完成这个任务(链接到之前的问题)。目前我有的代码是:
def region_to_variant_location_match(search_Variants, database_Variants):
'''Take dictionaries for search_Variants and database_Variants as input.
Match variants in database_Variants to regions within search_Variants.
Return matches as a nested dictionary.'''
#Match on Chr value
#Where Start value from database_variant is between St and End values in
search_variants.
#return as nested dictionary
我遇到的问题是,如何访问嵌套字典中的值(比如 Chr、St、End 等)来进行比较。我想用列表推导式来实现,因为我有很多数据需要处理,简单的 for 循环可能会比较耗时。
任何帮助都非常感谢!
更新
我尝试实现了下面 bioinfoboy 提出的解决方案。我的第一步是把 search_regions
和 database_variants
字典转换成 defaultdict(list)
,使用了以下函数:
def search_region_converter(searchDict):
'''This function takes the dictionary of dictionaries and converts it to a
DefaultDict(list) to allow matching
with the database in a corresponding format'''
search_regions = defaultdict(list)
for i in search_regions.keys():
chromosome = i.split(":")[0]
start = int(i.split(":")[1].split("-")[0])
end = int(i.split(":")[1].split("-")[1])
search_regions[chromosome].append((start, end))
return search_regions #a list with chromosomes as keys
def database_snps_converter(databaseDict):
'''This function takes the dictionary of dictionaries and converts it to a
DefaultDict(list) to allow matching
with the serach_snps in a corresponding format'''
database_variants = defaultdict(list)
for i in database_variants.keys():
chromosome = i.split(":")[0]
start = int(i.split(":")[1].split("-")[0])
database_variants[chromosome].append(start)
return database_variants #list of database variants
然后我用 bioinfoboy 的代码写了一个匹配的函数,代码如下:
def region_to_variant_location_match(search_Regions, database_Variants):
'''Take dictionaries for search_Variants and database_Variants as
input.
Match variants in database_Variants to regions within search_Variants.'''
for key, values in database_Variants.items():
for value in values:
for search_area in search_Regions[key]:
print(search_area)
if (value >= search_area[0]) and (value <= search_area[1]):
yield(key, search_area)
但是 defaultdict
函数返回的是空字典,我还不太明白需要改什么。
有什么想法吗?
2 个回答
1
你可能需要做类似下面的事情:
def region_to_variant_location_match(search_Variants, database_Variants):
'''Take dictionaries for search_Variants and database_Variants as input.
Match variants in database_Variants to regions within search_Variants.
Return matches as a nested dictionary.'''
return {
record[0]: record[1]
for record, lookup in zip(
database_Variants.items(),
search_Variants.items()
)
if (
record[1]['Chr'] == lookup[1]['Chr'] and
lookup[1]['Start'] <= record[1]['Start'] <= lookup[1]['End']
)
}
注意,如果你使用的是Python 2.7或更早的版本(而不是Python 3),那么你应该用iteritems()
来代替items()
,用itertools.izip()
来代替zip
。如果你使用的是2.6之前的版本,你还需要使用生成器表达式传给dict()
,而不是用dict
表达式。
1
我想这可能会对你有帮助。
我正在根据我在评论中提到的内容,转换你的 search_regions
和 database_variants
。
from collections import defaultdict
_database_variants = defaultdict(list)
_search_regions = defaultdict(list)
for i in database_variants.keys():
_chromosome = i.split(":")[0]
_start = int(i.split(":")[1].split("-")[0])
_database_variants[_chromosome].append(_start)
_search_regions = defaultdict(list)
for i in search_regions.keys():
_chromosome = i.split(":")[0]
_start = int(i.split(":")[1].split("-")[0])
_end = int(i.split(":")[1].split("-")[1])
_search_regions[_chromosome].append((_start, _end))
def _search(_database_variants, _search_regions):
for key, values in _database_variants.items():
for value in values:
for search_area in _search_regions[key]:
if (value >= search_area[0]) and (value <= search_area[1]):
yield(key, search_area)
我使用了 yield
,所以会返回一个生成器对象,你可以在这个对象上进行迭代。根据你在问题中最初提供的数据,我得到了以下输出。
for i in _search(_database_variants, _search_regions):
print(i)
输出结果如下:
('chr11', (56694718, 71838208))
('chr13', (27185654, 39682032))
这不是你想要实现的效果吗?