从一个较大的数据帧创建唯一命名的数据帧

2024-05-13 08:52:04 发布

您现在位置:Python中文网/ 问答频道 /正文

对于处理数据帧和循环来说,这是一个全新的概念。在python或R中查找我的查询的答案。我有一个结构类似于下面的数据帧。你知道吗

        TP1.v1  | TP1.v2 | TP1.v3 | TP2.v1 | TP2.v2 | TP2.v3 |... TPn.v1
 Gene A|  7     |6       |7       |6       |4       |1       |... 9    
 Gene B|  3     |4       |4       |4       |5       |3       |... 3    
 Gene n|  6     |1       |1       |5       |7       |7       |... 8     

我想为所有TP1、TP2等创建一个新的数据帧。每个TP(时间点)有3列与其关联。我也很想使用一个循环来做这件事,因为我有多个类似结构的文件。最后,我希望循环给每个新的数据帧一个新的、唯一的名称。你知道吗

我已经能够在R中完成这个任务,而不需要使用循环。简单地重复使用基本函数来操作数据帧。但这是相当缓慢和费力,所以我想这样做,在一个循环。你知道吗

理想的输出是n个唯一命名的数据帧,每个数据帧有3列,每个数据帧保留原始数据帧的行名和列名。你知道吗

下面我添加了来自R的dput(head(df))的输出

structure(list(D1.log2fc = c(-0.453086, -0.1828075, 0.105551500000001, 
0.368134000000001, 0.194800000000001, -0.327664499999999), D1.AveExp = c(4.9001385, 
5.59887075, 9.35607416666667, 9.466082, 9.28132575, 5.43070783333333    
), D1.adjPval = c(0.158162310733078, 0.680539779380169, 0.798318133631351, 
0.368809197240543, 0.588741274410125, 0.363696882398466), D3.log2fc = c(-0.5979695, 
-0.510921500000001, 0.544158999999999, 0.354766, 0.631701999999999, 
-0.365363499999998), D3.AveExp = c(4.9001385, 5.59887075, 9.35607416666667, 
9.466082, 9.28132575, 5.43070783333333), D3.adjPval =  c(0.0354796268783931, 
0.104426887750224, 0.0342979093938487, 0.318289098430963, 0.0318404713171763, 
0.231275103023615), D6.log2fc = c(-0.349413, -0.854375500000001, 
0.7416965, 0.5901225, 0.821465500000002, -0.578061499999999), 
D6.AveExp = c(4.9001385, 5.59887075, 9.35607416666667, 9.466082, 
9.28132575, 5.43070783333333), D6.adjPval = c(0.151181193217808, 
0.00788722811936, 0.00487109163210043, 0.0635131764099792, 
0.00547087529420614, 0.0423872835135151), D10.log2fc =      c(-0.528707499999999, 
-0.431807000000002, 0.454508000000001, 0.628860999999999, 
0.379918500000002, -0.195571999999999), D10.AveExp = c(4.9001385, 
5.59887075, 9.35607416666667, 9.466082, 9.28132575, 5.43070783333333
), D10.adjPval = c(0.0360033103086792, 0.125511404231851, 
0.0445352483558512, 0.0499786423872913, 0.126969394135026, 
0.517590415583245), D14.log2fc = c(-0.517372, -0.379950000000001, 
0.596869, 0.7255935, 0.6545535, -0.205755499999999), D14.AveExp = c(4.9001385, 
5.59887075, 9.35607416666667, 9.466082, 9.28132575, 5.43070783333333
), D14.adjPval = c(0.039311630129941, 0.172677856404577, 
0.0124695746689562, 0.0265985268105264, 0.0152333310246979, 
0.452405710914221)), row.names = c("hsa-let-7a-2", "hsa-let-7b", 
"hsa-let-7d", "hsa-let-7e", "hsa-let-7f", "hsa-let-7f1"), class = "data.frame")

Tags: 数据d1d3v1letgened6tp2
2条回答

R中有两种方法可以做到这一点

# assuming you know the prefix and how many time points you have (e.g. D and 5)
tp <- c(1, 3, 6, 10, 14)
prefix <- "D"

# for loop
for (i in tp) {
  common <- paste0(prefix, i) # create common name e.g. D1, D3, D6 etc.
  # assign columns to its unique df
  assign(common, df[, grep(paste0(common, "\\."), colnames(df), ignore.case = T)])
}

# using lapply (could be a bit faster than for loop)
lapply(tp, function(i) {
  common <- paste0(prefix, i) # create common name e.g. D1, D3, D6 etc.
  # assign columns to its unique df
  assign(common, df[, grep(paste0(common, "\\."), colnames(df), ignore.case = T)], envir = .GlobalEnv)
})

编辑:lapply实际上比for循环快得多。以下是microbenchmark结果

Unit: microseconds
        expr      min       lq      mean    median       uq      max neval
    for.loop 3045.718 3167.800 3549.2943 3284.6260 3424.485 79971.27  1000
 lapply.call  170.647  184.086  204.4465  192.4345  200.538  4123.52  1000

不知道您所说的唯一命名的数据帧是什么意思。这将创建一个包含每个数据帧的字典。希望有帮助。你知道吗

import pandas as pd
import numpy as np

# Sample Data
df = pd.DataFrame(np.random.rand(50,3*10), 
                  columns = ['TP%d.v%d'%(i, j) for i in range(1,11) for j in range(1,4)])

# Construct dictionary:
dd = {}
for name in df.columns.str.split('.').str[0].unique():
    dd[name] = df[df.columns[df.columns.str.startswith(name)]].copy()

如果您想使用多索引数据帧。下面的解决方案将简单地重新定义当前数据帧的列。处理这些问题可能会更复杂,但效率更高:

# MultiIndex Solution
df.columns = df.columns.str.split('.', expand=True)

相关问题 更多 >