我想做的事
我对Python有点陌生,使用panda库的经验有限。但是,我一直在尝试修改以下数据框,以便程序获取contents 3 CSV文件,基于第一个和第二个数据框中的数据创建两个新变量,然后将它们连接到一个名为Pred_arg的变量中-这是一个参考数据框,可以对其进行比较
第三个CSV文件是测试结果-为var df添加
接下来,我尝试创建一个脚本,扫描var的每一列并返回true或false(在输出表中)根据每个集群组至少有一个来自ABCPred和BCEPred的值的条件-目标是将结果打印到一个汇总表中,每个集群的值为true或false-如果集群结果中至少有一个值为true,则该集群被标记为true
我的目标是:
Cluster Number Status
clu1 True
clu2 True
clu3 False
... ...
clu57 True
稍后我可以使用group by函数对组进行排序,并计算所有为真的行和所有为假的行-最终我需要删除所有返回False的行,但我可以做到这一点
我到目前为止所做的事情
步骤1-读取ABCPred的结果并整理
ABCPred = pd.read_csv(r"C:\Users\tonyr\OneDrive - Ulster University\PhD - Stratified medicine research projects\COVID19 paper 2020\Data\Output data\ABCPred_res(254).csv")
ABCPred.columns = ['Seq','drop1','drop2','drop3','drop4']
ABCPred = ABCPred[ABCPred['Seq'].notna()]
ABCPred = ABCPred.drop(columns = ['drop1','drop2','drop3','drop4'])
print(ABCPred)
Seq
0 AGAAAYYVGYLQPRTF
1 AGCLIGAEHVNNSY
2 AGTITSGWTFGAGAAL
3 AGTITSGWTFGAGAALQIPF
4 ALEPLVDLPIGI
.. ...
248 YQTQTNSPRRARSVASQS
249 YSSANNCTFEYVSQPFLM
250 YSSANNCTFEYVSQPFLMDL
251 YTSALLAGTITSGWTFGA
252 YVGYLQPRTFLLKYNE
步骤2-从BCEPred和tidy读取结果
BCEPred = pd.read_csv(r"C:\Users\tonyr\OneDrive - Ulster University\PhD - Stratified medicine research projects\COVID19 paper 2020\Data\Output data\BCEPred_res_cor.csv")
print(BCEPred)
Seq
0 IHVSGTNGT
1 VYFASTEK
2 TTLDSKTQ
3 VYYHKNN
4 MDLEGKQ
5 SYLTPGDSS
6 DPLSETK
7 YAWNRKRI
8 QIAPGQT
9 NNLDSKVG
10 RLFRKSNL
11 ATVCGPKKST
12 GVLTESNK
13 VITPGTNTS
14 RVYSTGS
15 ASYQTQTNSPRRA
16 LPVSMTK
17 ICGDSTEC
18 IAVEQDKNT
19 QILPDPSKPSKR
20 GKIQDSLS
21 TLVKQLS
22 ECVLGQSKR
23 EVAKNLN
24 CKFDEDDS
步骤3-我将这些数据帧添加到一个名为Pred_arg的新帧中
Pred_arg = ABCPred.assign(ABCSeq = ABCPred['Seq'],BCEPred = BCEPred['Seq']).reset_index()
Pred_arg = Pred_arg.drop(columns = ['index','Seq'])
print(Pred_arg)
ABCSeq BCEPred
0 AGAAAYYVGYLQPRTF IHVSGTNGT
1 AGCLIGAEHVNNSY VYFASTEK
2 AGTITSGWTFGAGAAL TTLDSKTQ
3 AGTITSGWTFGAGAALQIPF VYYHKNN
4 ALEPLVDLPIGI MDLEGKQ
.. ... ...
248 YQTQTNSPRRARSVASQS NaN
249 YSSANNCTFEYVSQPFLM NaN
250 YSSANNCTFEYVSQPFLMDL NaN
251 YTSALLAGTITSGWTFGA NaN
252 YVGYLQPRTFLLKYNE NaN
现在我已经创建了参考数据框,我想再比较一下
步骤4-导入测试结果以进行比较
df = pd.read_csv(r"C:\Users\tonyr\OneDrive - Ulster University\PhD - Stratified medicine research projects\COVID19 paper 2020\Data\IEDB_dataset_run1.csv")
df = df.drop(columns = ['Alignment','Position','Description'])
df = df.drop(df[df.Peptide == '-'].index) #removes all rows where '-' exsists in the peptide column
df = df.drop(df[df['Peptide Number'] == 'Singleton'].index) #remove singletons
Cluster Number Peptide Number Peptide
1 1 1 QDVNCTEVPVAIHADQLTPT
2 1 2 DVNCTEVPVAIHADQLTPTW
3 1 3 EVPVAIHADQLTPTWRVYST
4 1 4 PVAIHADQLTPTWRVYSTGS
5 1 5 DQLTPTWRVYSTGSNV
.. ... ... ...
307 55 2 TQRNFYEPQIITTDNTFV
309 56 1 CCSCGSCCKFDEDDSE
310 56 2 CKFDEDDS
312 57 1 CCSCLKGCCSCGSCCKFD
313 57 2 CCSCLKGCCSCGSCCK
这就是我被困的地方
我已经尝试了根据步骤4中的集群进行分组,虽然这显示了一切,其中集群编号是从0到57的索引,但我无法使用该组来检查ABCPred和BCEPred是否在clu1中
如果我尝试对一个条件(即ABCPred结果)使用isin,它将返回false
df_groups = df.groupby(["Cluster Number"])["Peptide"].apply(list)
df_groups.columns = ['Cluster Number', 'Seq(s)']
print(df_groups)
Cluster Number
1 [QDVNCTEVPVAIHADQLTPT, DVNCTEVPVAIHADQLTPTW, E...
2 [ISVTTEILPVSMTKTSVDCT, EILPVSMTKTSVDCTMYI, ILP...
3 [STEKSNIIRGWIFGTTLD, KSNIIRGWIFGTTLDS, IRGWIFG...
4 [YQPYRVVVLSFELLHAPATV, SFELLHAPATVCGP, FELLHAP...
5 [LHRSYLTPGDSSSG, HRSYLTPGDSSSGWTA, SYLTPGDSSSG...
6 [VYSSANNCTFEYVSQPFL, YSSANNCTFEYVSQPFLMDL, YSS...
7 [QIPFAMQMAYRFNG, PFAMQMAYRFNGIGVT, FAMQMAYRFNG...
8 [ASYQTQTNSPRRA, YQTQTNSPRRARSVASQS, YQTQTNSPRR...
9 [EMIAQYTSALLAGTITSG, YTSALLAGTITSGWTFGA, LAGTI...
10 [TPCSFGGVSVITPGTNTSNQ, PCSFGGVSVITPGTNTSNQV, P...
11 [RGVYYPDKVFRSSVLHSTQD, GVYYPDKVFRSSVLHSTQ, KVF...
12 [YNENGTITDAVDCA, NENGTITDAVDCALDP, ENGTITDAVDC...
13 [GVSPTKLNDLCFTNVYADSF, TKLNDLCFTNVYADSFVI, NDL...
14 [GVYYHKNNKSWMESEFRV, VYYHKNNKSWMESEFRVYSS, VYY...
15 [PFGEVFNATRFASVYAWNRK, TRFASVYAWNRKRI, RFASVYA...
16 [AGCLIGAEHVNNSY, GCLIGAEHVNNSYECD, LIGAEHVNNSY...
17 [TEIYQAGSTPCNGVEG, YQAGSTPCNGVEGFNC, QAGSTPCNG...
18 [QQFGRDIADTTDAVRDPQTL, QQFGRDIADTTDAV, QFGRDIA...
19 [YFPLQSYGFQ, LQSYGFQPTNGVGYQP, YGFQPTNGVGYQPYR...
20 [IHVSGTNGTKRFDNPVLPFN, IHVSGTNGT, VSGTNGTKRFDN...
21 [NLREFVFKNIDGYFKIYS, EFVFKNIDGYFKIYSKHT, FKNID...
22 [IAVEQDKNT, AVEQDKNTQEVFAQ, VEQDKNTQEVFAQV, QD...
23 [DKVEAEVQIDRLITGRLQSL, EAEVQIDRLITGRLQSLQTY, Q...
24 [DSLSSTASALGKLQDV, LSSTASALGKLQDVVNQN, LSSTASA...
25 [PGQTGKIADYNYKLPD, GQTGKIADYNYKLP, TGKIADYNYKL...
26 [YEQYIKWPWYIWLGFIAG, YEQYIKWPWYIWLGFI, YIKWPWY...
27 [TVEKGIYQTSNFRVQP, EKGIYQTSNFRVQPTE, KGIYQTSNF...
28 [KSNLKPFERDISTEIYQA, SNLKPFERDISTEIYQAGST, FER...
29 [VLYNSASFSTFKCYGVSP, FSTFKCYGVSPTKL, STFKCYGVSP]
30 [HGVVFLHVTYVPAQEK, GVVFLHVTYVPAQEKNFT, HVTYVPA...
31 [PGTNTSNQVAVLYQDV, GTNTSNQVAVLYQDVNCT, TSNQVAV...
32 [KQIYKTPPIKDFGGFN, KTPPIKDFGGFN, TPPIKDFGGFNFS...
33 [VTQQLIRAAEIRASANLAAT, VTQQLIRAAEIRASANLA, TQQ...
34 [GCVIAWNSNNLDSKVGGNYN, CVIAWNSNNLDSKV, NNLDSKVG]
35 [GNYNYLYRLFRKSNLKPF, NYLYRLFRKSNL, RLFRKSNL]
36 [GGFNFSQILPDPSKPSKR, SQILPDPSKPSKRSFI, QILPDPS...
37 [SSNFGAISSVLNDI, SNFGAISSVLNDILSRLD, ISSVLNDIL...
38 [QKEIDRLNEVAKNLNE, KEIDRLNEVAKNLNESLI, EVAKNLN]
39 [FPNITNLCPFGEVFNA, PNITNLCPFGEVFN, NITNLCPFGEV...
40 [LTGTGVLTESNKKF, GVLTESNK]
41 [VLPFNDGVYFASTE, VYFASTEK]
42 [ECSNLLLQYGSFCTQLNRAL, LQYGSFCTQL]
43 [EVRQIAPGQTGKIADY, QIAPGQT]
44 [QLPPAYTNSFTR, PPAYTNSFTRGVYY]
45 [VTLADAGFIKQYGDCLGDIA, GFIKQYGDCLGDIAARDLIC]
46 [TLVKQLS, LVKQLSSNFGAISS]
47 [IGKIQDSLSSTASALG, GKIQDSLS]
48 [TNVVIKVCEFQFCNDP, VVIKVCEFQFCNDPFLGVYY]
49 [ESLIDLQELGKYEQYI, DLQELGKYEQYIKWPWYI]
50 [GDIAARDLICAQKFNGLT, RDLICAQKFNGLTVLP]
51 [PQGFSALEPLVDLPIGIN, ALEPLVDLPIGI]
52 [VVIGIVNNTVYDPLQPEL, VIGIVNNTVYDPLQPE]
53 [EILDITPCSFGGVSVI, EILDITPCSFGGVS]
54 [NFRVQPTESIVRFPNITN, VQPTESIVRFPNITNL]
55 [WFVTQRNFYEPQII, TQRNFYEPQIITTDNTFV]
56 [CCSCGSCCKFDEDDSE, CKFDEDDS]
57 [CCSCLKGCCSCGSCCKFD, CCSCLKGCCSCGSCCK]
rslt_df = Pred_arg['ABCSeq'].isin(df_groups)
print(rslt_df.describe()) ## comparason coming back all false !!!!!!!
count 253
unique 1
top False
freq 253
Name: ABCSeq, dtype: object
我知道我遗漏了一些很可能非常简单的东西——但我认为一些新的视角和指导会对改进我的练习有很大帮助
更新
我似乎能够使用下面的方法对细胞内容进行比较——尽管它相当粗糙
#comparing group to pred_arg
rslt_df1 = Pred_arg['ABCSeq'].isin(df['Peptide'])
rslt_df2 = Pred_arg['BCEPred'].isin(df['Peptide'])
rslt_df = df.assign(ABCSeq = rslt_df1, BCEPred = rslt_df2).reset_index()
concencus = Pred_arg['ABCSeq'].isin(df['Peptide']) & Pred_arg['BCEPred'].isin(df['Peptide'])
print(concencus.describe()) # working better
count 253
unique 2
top False
freq 232
dtype: object
谢谢:)
目前没有回答
相关问题 更多 >
编程相关推荐