将两个文本文件合并到一个新的文本文件中,并将它们汇总成一个新的文本文件

2024-04-23 18:58:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个文本文件像下面的例子。我将其中一个命名为firstcomma separated),另一个命名为secondtab separated)。你知道吗

first

chr1,105000000,105310000,2,1,3,2
chr1,5310000,5960000,2,1,5,4
chr1,1580000,1180000,4,1,5,3
chr19,107180000,107680000,1,1,5,4
chr1,7680000,8300000,3,1,1,2
chr1,109220000,110070000,4,2,3,3
chr1,11060000,12070000,6,2,7,4

second

AKAP8L  chr19   107180100   107650000   transcript
AKAP8L  chr19   15514130    15529799    transcript
AKIRIN2 chr6    88384790    88411927    transcript
AKIRIN2 chr6    88410228    88411243    transcript
AKT3    chr1    105002000   105010000   transcript
AKT3    chr1    243663021   244006886   transcript
AKT3    chr1    243665065   244013430   transcript

在第一个文件中,23列是start和end。在第二个文件列中,34分别是start和end。我想从第一个和第二个文件中创建一个新的文本文件。 在新文件中,我想根据以下条件(3列)计算file second中与file first中的每一行匹配的行数:

1- the 1st column in file first is equal to 2nd column in file second.
2- the 3rd column in the file second is greater than the the 2nd column in the file first and also smaller than the 3rd column in the file first.
3- the 4th column in the file second should be also greater than the the 2nd column in the file first and also smaller than the 3rd column in the file first.

在act中,输出看起来像预期的输出。前7列直接来自file first,但第9列是file second中匹配file first中每一行的行数(基于上述3个标准)。而8th column将是“来自file second的行的第一列,它首先匹配文件的特定行”

expected output

chr19,107180000,107680000,1,1,5,4,AKAP8L, 1
chr1,105000000,105310000,2,1,3,2, AKT3, 1

我正试图用python实现这一点,并编写了这段代码,但它并没有返回我想要的内容。你知道吗

first = open('first.csv', 'rb')
second = open('second.txt', 'rb')
first_file = []
for line in first:
    first_file.append(line.split(','))

second_file = []
for line2 in second:
    second_file.append(line.split())

count=0
final = []
for i in range(len(first_file)):
    for j in range(len(second_file)):
        first_row = first_file[i]
        second_row = second_file[j]
        first_col = first_row.split()
        second_col = second_row.split()
        if first_col[0] == second_col[1] and first_col[1] < second_col[2] < first_col[2] and first_col[1] < second_col[3] < first_col[2]
            count+=1
            final.append(first_col[i]+second_col[0]+count)

Tags: and文件theinforcolumncolfile
2条回答

考虑到您没有列名,这看起来非常健壮,但它可以工作并使用pandas

import pandas as pd

first = 'first.csv'
second = 'second.txt'

df1 = pd.read_csv(first, header=None)
df2 = pd.read_csv(second, sep='\s+', header=None)

merged = df1.merge(df2, left_on=[0], right_on=[1], suffixes=('first', 'second'))
a, b, c, d = merged['2second'], merged['1first'], merged['2first'], merged['3second']

cleaned = merged[(c>a)&(a>b)&(c>d)&(d>b)]

counted = cleaned.groupby(['0first', '1first', '2first', '3first', '4first', 5, 6, '0second'])['4second'].count().reset_index()

counted.to_csv('result.csv', index=False, header=False)

这将生成具有以下内容的result.csv

chr1,105000000,105310000,2,1,3,2,AKT3,1
chr19,107180000,107680000,1,1,5,4,AKAP8L,1

在你相同的设置下,如果你按下面的操作,它会工作。你知道吗

first = open('first.csv', 'r')
second = open('second.txt', 'r')
first_file = []
for line in first:
    first_file.append(line.strip())
second_file = []
for line2 in second:
    second_file.append(line2)
count=0
final = []
for i in range(len(first_file)):
    for j in range(len(second_file)):
        first_row = first_file[i]
        second_row = second_file[j]
        first_col = first_row.split(',')
        second_col = second_row.split()
        if (first_col[0] == second_col[1]) and (first_col[1] < second_col[2] < first_col[2]) and (first_col[1] < second_col[3] < first_col[2]):
            count = count + 1
            final.append(first_row +','+second_col[0]+',' + str(count))
print(final)

这将产生你想要的结果。你知道吗

['chr1,105000000,105310000,2,1,3,2,AKT3,1', 'chr19,107180000,107680000,1,1,5,4,AKAP8L,2']

相关问题 更多 >