将字符串数组与二维数组匹配

longest_gene_name = len(max(gene_name_list, key=len)) ensembl_list = np.full((len(gene_name_list)), '', dtype='U{}'.format(longest_gene_name)) for idx, gene_name in enumerate(gene_name_list): for row in fully_split: if gene_name in row: ensembl_list[idx] = row[0]

2条回答

网友

1楼 · 编辑于 2024-05-23 18:39:14

相隔执行时间，我认为你发布的暴力方法与你用文字描述的不符：

I need to find rows in another 2D array where each string of the first array is present.

您的代码最多只能在那里找到所有行二维数组的行中至少存在一个一维数组的字符串。你知道吗

下面的代码使用regex执行words中的请求。你知道吗

import re

pattern = r'*'.join(map(re.escape, np.sort(gene_name_list)))
rows = [''.join(np.sort(x)) for x in fully_split]
res = [re.search(pattern, r) for r in rows]

因为顺序是不相关的，所以gene_name_list是按字典顺序排序的，字符串是使用regex特殊字符'*'作为定界符连接起来的。这是将要搜索的模式。
然后，2D数组fully_split的每一行再次按字典顺序排序，字符串连接起来形成一个字符串。对每行执行正则表达式搜索以检查是否存在匹配项。你知道吗

res是一个列表，对于那些没有找到匹配项的行，您可以得到None，而对应的MatchObject是找到的匹配项。你知道吗

这说明了这个概念。为了更接近预期结果（存储行的第一个元素），请将最后一行替换为：

res = [l[0] if re.search(pattern, r) else None for r, l in zip(rows, fully_split)]

网友
2楼 · 编辑于 2024-05-23 18:39:14

根据你的描述，我做了几个假设：
-2d数组是矩形的（即不是dtype=object），否则NumPy性能将毫无用处。
-len(fully_split) == len(gene_name_list)因为您的代码示例有ensembl_list[idx] = row[0]，idx是从gene_name_list派生的
>>> gene_name_list = np.array('a bb c d eee'.split()) >>> fully_split = np.array([ ... 'id1 a bb c d eee'.split(), # yes ... 'id2 f g hh iii j'.split(), ... 'id3 kk ll a nn o'.split(), # yes ... 'id4 q rr c t eee'.split(), # yes ... 'id5 v www xx y z'.split() ... ]) >>> longest_gene_name = len(max(gene_name_list, key=len)) >>> dtype = 'U{}'.format(longest_gene_name) >>> ensembl_list = np.zeros_like(gene_name_list, dtype=dtype) >>> mask = np.isin(fully_split, gene_name_list).any(axis=1) >>> ensembl_list[mask] = fully_split[mask, 0] >>> ensembl_list array(['id1', '', 'id3', 'id4', ''], dtype='<U3')

相关问题更多 >

编程相关推荐

热门问题

热门文章