如何使用来自另一个datafram的部分匹配过滤一个数据帧

promoter = pd.read_csv('promoter_coordinate.csv') print(promoter.head()) AssociatedGeneName B C D E F plexB_1 NC_004353.3 64381 - Drosophila melanogaster (Fruit fly) region ci_1 NC_004353.3 76925 - Drosophila melanogaster (Fruit fly) region RS3A_1 NC_004353.3 87829 - Drosophila melanogaster (Fruit fly) region pan_1 NC_004353.3 89986 + Drosophila melanogaster (Fruit fly) region pan_2 NC_004353.3 90281 + Drosophila melanogaster (Fruit fly) region data = pd.read_csv('FBgn with gene name.csv') print(data.head()) Gene AssociatedGeneName FBgn Number timepoint CG10002 fkh FBgn0000659 2 CG10002 fkh FBgn0000659 2 CG10002 fkh FBgn0000659 2 CG10002 fkh FBgn0000659 2 CG10006 CG10006 FBgn0036461 2 x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])]

x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])] Traceback (most recent call last): File "<pyshell#15>", line 1, in <module> x = promoter[promoter['AssociatedGeneName'].str.contains(data['Associated Gene Name'])] File "C:\Python34\lib\site-packages\pandas\core\strings.py", line 1226, in contains na=na, regex=regex) File "C:\Python34\lib\site-packages\pandas\core\strings.py", line 203, in str_contains regex = re.compile(pat, flags=flags) File "C:\Python34\lib\re.py", line 219, in compile return _compile(pattern, flags) File "C:\Python34\lib\re.py", line 278, in _compile return _cache[type(pattern), pattern, flags] File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 663, in __hash__ ' hashed'.format(self.__class__.__name__)) TypeError: 'Series' objects are mutable, thus they cannot be hashed

2条回答

网友

1楼 · 编辑于 2024-06-12 05:13:57

str.contains接受一个字符串作为参数，并检查该字符串是否包含在每个promoter.AssociatedGene项中，然后为每个索引（行）返回True或{}。在

但是，当您将data.AssociatedGene传递给str.contains函数时，您传递的是一个pandas.Series，这就是您收到错误的原因。在

如果您只需要启动程序具有部分匹配的行，则可以

where_inds_par = [ where(promoter.AssociatedGeneName.str.contains(partial) )[0] for partial in data.AssociatedGeneName  ]

现在，where_inds_par的每个元素本身就是一个长度为>= 0的索引数组。另外，由于您的data.AssociatedGeneName列是冗余的，因此会有一些冗余，但是您可以使用set和一些花哨的列表理解来过滤掉

^{pr2}$

网友

2楼 · 编辑于 2024-06-12 05:13:57

首先创建一个函数来检查来自promoter的值是否与data的部分匹配，这将检查data中的每个值

def contain_partial(x , y = data.AssociatedGeneName):
        res = []
        for z in y:
            res.append(z in x)
        return res

这将是函数的结果

^{pr2}$

然后在最后检查是否至少有一个值为true，然后返回true并过滤 promoter

promoter[contains.apply(any)]

相关问题更多 >

编程相关推荐

热门问题

热门文章