使用Python统计TSV文件某列中单词出现次数
这是一个来自Python初学者的问题!我有一个tsv文件,内容大致是这样的:
WHI5 YOR083W CDC28 YBR160W physical interactions 19823668
WHI5 YOR083W CDC28 YBR160W physical interactions 21658602
WHI5 YOR083W CDC28 YBR160W physical interactions 24186061
WHI5 YOR083W RPD3 YNL330C physical interactions 19823668
WHI5 YOR083W SWI4 YER111C physical interactions 15210110
WHI5 YOR083W SWI4 YER111C physical interactions 15210111
我想统计所有在第3列中包含相同单词的行,并且只输出第一次出现的那个单词,以及它出现的次数,放在一个新的列里。
WHI5 YOR083W CDC28 YBR160W physical interactions 19823668 3
WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2
到目前为止,我尝试过把'csv'和'Counter',或者'pandas'和'Counter'结合使用,但都没有成功……
1 个回答
3
使用pandas库:
>>> import pandas as pd
>>> from io import BytesIO
>>> df = pd.read_table(BytesIO("""\
... col1 col2 col3 col4 col5 col6
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 19823668
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 21658602
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 24186061
... WHI5 YOR083W RPD3 YNL330C "physical interactions" 19823668
... WHI5 YOR083W SWI4 YER111C "physical interactions" 15210110
... WHI5 YOR083W SWI4 YER111C "physical interactions" 15210111"""),
... delim_whitespace=True)
pandas的数据框看起来会是这样的:
>>> df
col1 col2 col3 col4 col5 col6
0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668
1 WHI5 YOR083W CDC28 YBR160W physical interactions 21658602
2 WHI5 YOR083W CDC28 YBR160W physical interactions 24186061
3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668
4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110
5 WHI5 YOR083W SWI4 YER111C physical interactions 15210111
[6 rows x 6 columns]
要计算数量,可以根据col3
进行分组,然后计算每个组的长度:
>>> df['cnt'] = df.groupby('col3')['col3'].transform(len)
>>> df
col1 col2 col3 col4 col5 col6 cnt
0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668 3
1 WHI5 YOR083W CDC28 YBR160W physical interactions 21658602 3
2 WHI5 YOR083W CDC28 YBR160W physical interactions 24186061 3
3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2
5 WHI5 YOR083W SWI4 YER111C physical interactions 15210111 2
[6 rows x 7 columns]
要选择每个组的第一个元素:
>>> df.groupby('col3').apply(lambda obj: obj.head(n=1))
col1 col2 col3 col4 col5 col6 cnt
col3
CDC28 0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668 3
RPD3 3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
SWI4 4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2
[3 rows x 7 columns]