不使用数据库从表中获取项目列表

Col5a2 NM_007737 chr1 - 45447828 45447829 Slc40a1 NM_016917 chr1 - 45870140 45870141 Gm3852 NM_001177356 chr1 - 45956809 45956810 Slc39a10 NM_172653 chr1 - 46798055 46798056 Obfc2a NM_028696 chr1 - 51422944 51422945 Myo1b NM_001161817,NM_010863 chr1 - 51860519 51860520 . . .

Slc39a10 NM_172653 chr1 - 46798055 46798056 Myo1b NM_001161817,NM_010863 chr1 - 51860519 51860520 Col5a2 NM_007737 chr1 - 45447828 45447829 Obfc2a NM_028696 chr1 - 51422944 51422945 . . .

3条回答

网友

1楼 · 编辑于 2024-04-19 10:13:10

如果只对表的第一列进行非常简单的查找，那么pythondict可能就足够了。你知道吗

像这样建造：

table = {}
with open(table_file) as f:
    for line in f:
        row = line.split()
        table[row[0]] = row

然后，您可以使用列表来“加入”此词典：

results = [table[key] for key in keys_list]

或者，如果第二个列表也是一个数据文件，则可以执行以下操作：

with open(second_file) as f:
    results = [table[line.strip] for line in f]

网友

2楼 · 编辑于 2024-04-19 10:13:10

使用两个标准unix工具join(1)和sort(1)确实可以实现这一点：

$ join <(sort table) <(sort list)

Col5a2 NM_007737 chr1 - 45447828 45447829
Myo1b NM_001161817,NM_010863 chr1 - 51860519 51860520
Obfc2a NM_028696 chr1 - 51422944 51422945
Slc40a1 NM_016917 chr1 - 45870140 45870141

需要调用sort，因为（从join man page）：

Important: FILE1 and FILE2 must be sorted on the join fields. E.g., use 'sort -k 1b,1' if 'join' has no options. Note, comparisons honor the rules specified by 'LC_COLLATE'. If the input is not sorted and some lines cannot be joined, a warning message will be given.

更新：受this answer启发的解决方案，保持秩序：

$ join -1 2 <(cat -n list | sort -k2,2) <(sort table) | sort -nk2,2 | cut -d\  -f1,3-

网友

3楼 · 编辑于 2024-04-19 10:13:10

这里有一种使用awk的方法：

awk 'FNR==NR { a[$1]=$0; next } $1 in a { print a[$1] }' table list

或使用格式：

awk 'FNR==NR { a[$1]=$0; next } $1 in a { print a[$1] }' table list | column -t

结果：

Slc40a1  NM_016917               chr1  -  45870140  45870141
Myo1b    NM_001161817,NM_010863  chr1  -  51860519  51860520
Col5a2   NM_007737               chr1  -  45447828  45447829
Obfc2a   NM_028696               chr1  -  51422944  51422945

说明：

'FNR==NR { ... }' is a conditional that is only true for the first file in the argument list.
So for each line in the file called 'table', the first column ($1) is added to an array (called 'a') and this is assigned the value of the whole line ($0). 'next' then skips to remainder of the code and jumps to the next line of input, until all lines in the 'table' file have been processed.
“$1 in a”是另一个条件。你知道吗
This is asking if column one of the 'list' file is a key in the array. If it is, then print out the value of column one that we just stored in the array (a[$1]).

相关问题更多 >

编程相关推荐

热门问题

热门文章