python在dataframe中有条件地循环列,并与引用数据集进行比较

2024-05-16 09:51:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我想做的事

我对Python有点陌生,使用panda库的经验有限。但是,我一直在尝试修改以下数据框,以便程序获取contents 3 CSV文件,基于第一个和第二个数据框中的数据创建两个新变量,然后将它们连接到一个名为Pred_arg的变量中-这是一个参考数据框,可以对其进行比较

第三个CSV文件是测试结果-为var df添加

接下来,我尝试创建一个脚本,扫描var的每一列并返回true或false(在输出表中)根据每个集群组至少有一个来自ABCPred和BCEPred的值的条件-目标是将结果打印到一个汇总表中,每个集群的值为true或false-如果集群结果中至少有一个值为true,则该集群被标记为true

我的目标是:

Cluster Number  Status 
clu1            True          
clu2            True
clu3            False
...             ...
clu57           True

稍后我可以使用group by函数对组进行排序,并计算所有为真的行和所有为假的行-最终我需要删除所有返回False的行,但我可以做到这一点

我到目前为止所做的事情

步骤1-读取ABCPred的结果并整理

ABCPred = pd.read_csv(r"C:\Users\tonyr\OneDrive - Ulster University\PhD - Stratified medicine research projects\COVID19 paper 2020\Data\Output data\ABCPred_res(254).csv")
ABCPred.columns = ['Seq','drop1','drop2','drop3','drop4']
ABCPred = ABCPred[ABCPred['Seq'].notna()]
ABCPred = ABCPred.drop(columns = ['drop1','drop2','drop3','drop4'])
print(ABCPred)


                      Seq
0        AGAAAYYVGYLQPRTF
1          AGCLIGAEHVNNSY
2        AGTITSGWTFGAGAAL
3    AGTITSGWTFGAGAALQIPF
4            ALEPLVDLPIGI
..                    ...
248    YQTQTNSPRRARSVASQS
249    YSSANNCTFEYVSQPFLM
250  YSSANNCTFEYVSQPFLMDL
251    YTSALLAGTITSGWTFGA
252      YVGYLQPRTFLLKYNE

步骤2-从BCEPred和tidy读取结果

BCEPred = pd.read_csv(r"C:\Users\tonyr\OneDrive - Ulster University\PhD - Stratified medicine research projects\COVID19 paper 2020\Data\Output data\BCEPred_res_cor.csv")
print(BCEPred)

            Seq
0       IHVSGTNGT
1        VYFASTEK
2        TTLDSKTQ
3         VYYHKNN
4         MDLEGKQ
5       SYLTPGDSS
6         DPLSETK
7        YAWNRKRI
8         QIAPGQT
9        NNLDSKVG
10       RLFRKSNL
11     ATVCGPKKST
12       GVLTESNK
13      VITPGTNTS
14        RVYSTGS
15  ASYQTQTNSPRRA
16        LPVSMTK
17       ICGDSTEC
18      IAVEQDKNT
19   QILPDPSKPSKR
20       GKIQDSLS
21        TLVKQLS
22      ECVLGQSKR
23        EVAKNLN
24       CKFDEDDS

步骤3-我将这些数据帧添加到一个名为Pred_arg的新帧中

Pred_arg = ABCPred.assign(ABCSeq = ABCPred['Seq'],BCEPred = BCEPred['Seq']).reset_index()
Pred_arg = Pred_arg.drop(columns = ['index','Seq'])
print(Pred_arg)

                   ABCSeq    BCEPred
0        AGAAAYYVGYLQPRTF  IHVSGTNGT
1          AGCLIGAEHVNNSY   VYFASTEK
2        AGTITSGWTFGAGAAL   TTLDSKTQ
3    AGTITSGWTFGAGAALQIPF    VYYHKNN
4            ALEPLVDLPIGI    MDLEGKQ
..                    ...        ...
248    YQTQTNSPRRARSVASQS        NaN
249    YSSANNCTFEYVSQPFLM        NaN
250  YSSANNCTFEYVSQPFLMDL        NaN
251    YTSALLAGTITSGWTFGA        NaN
252      YVGYLQPRTFLLKYNE        NaN

现在我已经创建了参考数据框,我想再比较一下

步骤4-导入测试结果以进行比较

df = pd.read_csv(r"C:\Users\tonyr\OneDrive - Ulster University\PhD - Stratified medicine research projects\COVID19 paper 2020\Data\IEDB_dataset_run1.csv")
df = df.drop(columns = ['Alignment','Position','Description'])
df = df.drop(df[df.Peptide == '-'].index) #removes all rows where '-' exsists in the peptide column
df = df.drop(df[df['Peptide Number'] == 'Singleton'].index) #remove singletons

     Cluster Number Peptide Number               Peptide
1                 1              1  QDVNCTEVPVAIHADQLTPT
2                 1              2  DVNCTEVPVAIHADQLTPTW
3                 1              3  EVPVAIHADQLTPTWRVYST
4                 1              4  PVAIHADQLTPTWRVYSTGS
5                 1              5      DQLTPTWRVYSTGSNV
..              ...            ...                   ...
307              55              2    TQRNFYEPQIITTDNTFV
309              56              1      CCSCGSCCKFDEDDSE
310              56              2              CKFDEDDS
312              57              1    CCSCLKGCCSCGSCCKFD
313              57              2      CCSCLKGCCSCGSCCK

这就是我被困的地方

我已经尝试了根据步骤4中的集群进行分组,虽然这显示了一切,其中集群编号是从0到57的索引,但我无法使用该组来检查ABCPred和BCEPred是否在clu1中

如果我尝试对一个条件(即ABCPred结果)使用isin,它将返回false

df_groups = df.groupby(["Cluster Number"])["Peptide"].apply(list)
df_groups.columns = ['Cluster Number', 'Seq(s)']
print(df_groups)

Cluster Number
1     [QDVNCTEVPVAIHADQLTPT, DVNCTEVPVAIHADQLTPTW, E...
2     [ISVTTEILPVSMTKTSVDCT, EILPVSMTKTSVDCTMYI, ILP...
3     [STEKSNIIRGWIFGTTLD, KSNIIRGWIFGTTLDS, IRGWIFG...
4     [YQPYRVVVLSFELLHAPATV, SFELLHAPATVCGP, FELLHAP...
5     [LHRSYLTPGDSSSG, HRSYLTPGDSSSGWTA, SYLTPGDSSSG...
6     [VYSSANNCTFEYVSQPFL, YSSANNCTFEYVSQPFLMDL, YSS...
7     [QIPFAMQMAYRFNG, PFAMQMAYRFNGIGVT, FAMQMAYRFNG...
8     [ASYQTQTNSPRRA, YQTQTNSPRRARSVASQS, YQTQTNSPRR...
9     [EMIAQYTSALLAGTITSG, YTSALLAGTITSGWTFGA, LAGTI...
10    [TPCSFGGVSVITPGTNTSNQ, PCSFGGVSVITPGTNTSNQV, P...
11    [RGVYYPDKVFRSSVLHSTQD, GVYYPDKVFRSSVLHSTQ, KVF...
12    [YNENGTITDAVDCA, NENGTITDAVDCALDP, ENGTITDAVDC...
13    [GVSPTKLNDLCFTNVYADSF, TKLNDLCFTNVYADSFVI, NDL...
14    [GVYYHKNNKSWMESEFRV, VYYHKNNKSWMESEFRVYSS, VYY...
15    [PFGEVFNATRFASVYAWNRK, TRFASVYAWNRKRI, RFASVYA...
16    [AGCLIGAEHVNNSY, GCLIGAEHVNNSYECD, LIGAEHVNNSY...
17    [TEIYQAGSTPCNGVEG, YQAGSTPCNGVEGFNC, QAGSTPCNG...
18    [QQFGRDIADTTDAVRDPQTL, QQFGRDIADTTDAV, QFGRDIA...
19    [YFPLQSYGFQ, LQSYGFQPTNGVGYQP, YGFQPTNGVGYQPYR...
20    [IHVSGTNGTKRFDNPVLPFN, IHVSGTNGT, VSGTNGTKRFDN...
21    [NLREFVFKNIDGYFKIYS, EFVFKNIDGYFKIYSKHT, FKNID...
22    [IAVEQDKNT, AVEQDKNTQEVFAQ, VEQDKNTQEVFAQV, QD...
23    [DKVEAEVQIDRLITGRLQSL, EAEVQIDRLITGRLQSLQTY, Q...
24    [DSLSSTASALGKLQDV, LSSTASALGKLQDVVNQN, LSSTASA...
25    [PGQTGKIADYNYKLPD, GQTGKIADYNYKLP, TGKIADYNYKL...
26    [YEQYIKWPWYIWLGFIAG, YEQYIKWPWYIWLGFI, YIKWPWY...
27    [TVEKGIYQTSNFRVQP, EKGIYQTSNFRVQPTE, KGIYQTSNF...
28    [KSNLKPFERDISTEIYQA, SNLKPFERDISTEIYQAGST, FER...
29     [VLYNSASFSTFKCYGVSP, FSTFKCYGVSPTKL, STFKCYGVSP]
30    [HGVVFLHVTYVPAQEK, GVVFLHVTYVPAQEKNFT, HVTYVPA...
31    [PGTNTSNQVAVLYQDV, GTNTSNQVAVLYQDVNCT, TSNQVAV...
32    [KQIYKTPPIKDFGGFN, KTPPIKDFGGFN, TPPIKDFGGFNFS...
33    [VTQQLIRAAEIRASANLAAT, VTQQLIRAAEIRASANLA, TQQ...
34     [GCVIAWNSNNLDSKVGGNYN, CVIAWNSNNLDSKV, NNLDSKVG]
35         [GNYNYLYRLFRKSNLKPF, NYLYRLFRKSNL, RLFRKSNL]
36    [GGFNFSQILPDPSKPSKR, SQILPDPSKPSKRSFI, QILPDPS...
37    [SSNFGAISSVLNDI, SNFGAISSVLNDILSRLD, ISSVLNDIL...
38      [QKEIDRLNEVAKNLNE, KEIDRLNEVAKNLNESLI, EVAKNLN]
39    [FPNITNLCPFGEVFNA, PNITNLCPFGEVFN, NITNLCPFGEV...
40                           [LTGTGVLTESNKKF, GVLTESNK]
41                           [VLPFNDGVYFASTE, VYFASTEK]
42                   [ECSNLLLQYGSFCTQLNRAL, LQYGSFCTQL]
43                          [EVRQIAPGQTGKIADY, QIAPGQT]
44                       [QLPPAYTNSFTR, PPAYTNSFTRGVYY]
45         [VTLADAGFIKQYGDCLGDIA, GFIKQYGDCLGDIAARDLIC]
46                            [TLVKQLS, LVKQLSSNFGAISS]
47                         [IGKIQDSLSSTASALG, GKIQDSLS]
48             [TNVVIKVCEFQFCNDP, VVIKVCEFQFCNDPFLGVYY]
49               [ESLIDLQELGKYEQYI, DLQELGKYEQYIKWPWYI]
50               [GDIAARDLICAQKFNGLT, RDLICAQKFNGLTVLP]
51                   [PQGFSALEPLVDLPIGIN, ALEPLVDLPIGI]
52               [VVIGIVNNTVYDPLQPEL, VIGIVNNTVYDPLQPE]
53                   [EILDITPCSFGGVSVI, EILDITPCSFGGVS]
54               [NFRVQPTESIVRFPNITN, VQPTESIVRFPNITNL]
55                 [WFVTQRNFYEPQII, TQRNFYEPQIITTDNTFV]
56                         [CCSCGSCCKFDEDDSE, CKFDEDDS]
57               [CCSCLKGCCSCGSCCKFD, CCSCLKGCCSCGSCCK]


rslt_df = Pred_arg['ABCSeq'].isin(df_groups)
print(rslt_df.describe()) ## comparason coming back all false !!!!!!!

count       253
unique        1
top       False
freq        253
Name: ABCSeq, dtype: object

我知道我遗漏了一些很可能非常简单的东西——但我认为一些新的视角和指导会对改进我的练习有很大帮助

更新

我似乎能够使用下面的方法对细胞内容进行比较——尽管它相当粗糙

#comparing group to pred_arg
rslt_df1 = Pred_arg['ABCSeq'].isin(df['Peptide'])
rslt_df2 = Pred_arg['BCEPred'].isin(df['Peptide'])
rslt_df = df.assign(ABCSeq = rslt_df1, BCEPred = rslt_df2).reset_index()
concencus = Pred_arg['ABCSeq'].isin(df['Peptide']) & Pred_arg['BCEPred'].isin(df['Peptide'])

print(concencus.describe()) # working better

count       253
unique        2
top       False
freq        232
dtype: object

谢谢:)


Tags: csv数据numberdfarg集群seqprint