不使用数据库从表中获取项目列表

2024-04-19 10:13:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在一个表上做一些简单的数据库操作而不必使用数据库软件,例如,我可以使用GitHub的“filo”包来做一些类似于“groupby”函数的事情。我想知道有没有类似的东西来实现一些简单的“连接”功能?或者我可以用Python或Bash来实现它?具体来说,我有一张这样的桌子:

Col5a2  NM_007737   chr1    -   45447828    45447829
Slc40a1 NM_016917   chr1    -   45870140    45870141
Gm3852  NM_001177356    chr1    -   45956809    45956810
Slc39a10    NM_172653   chr1    -   46798055    46798056
Obfc2a  NM_028696   chr1    -   51422944    51422945
Myo1b   NM_001161817,NM_010863  chr1    -   51860519    51860520
.
.
.

我有一张单子

Slc40a1
Myo1b
Col5a2
Obfc2a
.
.
.

我想从表中提取列表中的项目,以便得到:

Slc39a10    NM_172653   chr1    -   46798055    46798056
Myo1b   NM_001161817,NM_010863  chr1    -   51860519    51860520
Col5a2  NM_007737   chr1    -   45447828    45447829
Obfc2a  NM_028696   chr1    -   51422944    51422945
.
.
.

Tags: 函数功能github数据库软件事情groupbychr1
3条回答

如果只对表的第一列进行非常简单的查找,那么pythondict可能就足够了。你知道吗

像这样建造:

table = {}
with open(table_file) as f:
    for line in f:
        row = line.split()
        table[row[0]] = row

然后,您可以使用列表来“加入”此词典:

results = [table[key] for key in keys_list]

或者,如果第二个列表也是一个数据文件,则可以执行以下操作:

with open(second_file) as f:
    results = [table[line.strip] for line in f]

使用两个标准unix工具join(1)sort(1)确实可以实现这一点:

$ join <(sort table) <(sort list)

Col5a2 NM_007737 chr1 - 45447828 45447829
Myo1b NM_001161817,NM_010863 chr1 - 51860519 51860520
Obfc2a NM_028696 chr1 - 51422944 51422945
Slc40a1 NM_016917 chr1 - 45870140 45870141

需要调用sort,因为(从join man page):

Important: FILE1 and FILE2 must be sorted on the join fields. E.g., use 'sort -k 1b,1' if 'join' has no options. Note, comparisons honor the rules specified by 'LC_COLLATE'. If the input is not sorted and some lines cannot be joined, a warning message will be given.

更新:this answer启发的解决方案,保持秩序:

$ join -1 2 <(cat -n list | sort -k2,2) <(sort table) | sort -nk2,2 | cut -d\  -f1,3-

这里有一种使用awk的方法:

awk 'FNR==NR { a[$1]=$0; next } $1 in a { print a[$1] }' table list

或使用格式:

awk 'FNR==NR { a[$1]=$0; next } $1 in a { print a[$1] }' table list | column -t

结果:

Slc40a1  NM_016917               chr1  -  45870140  45870141
Myo1b    NM_001161817,NM_010863  chr1  -  51860519  51860520
Col5a2   NM_007737               chr1  -  45447828  45447829
Obfc2a   NM_028696               chr1  -  51422944  51422945

说明:

'FNR==NR { ... }' is a conditional that is only true for the first file in the argument list.

So for each line in the file called 'table', the first column ($1) is added to an array (called 'a') and this is assigned the value of the whole line ($0). 'next' then skips to remainder of the code and jumps to the next line of input, until all lines in the 'table' file have been processed.

“$1 in a”是另一个条件。你知道吗

This is asking if column one of the 'list' file is a key in the array. If it is, then print out the value of column one that we just stored in the array (a[$1]).

相关问题 更多 >