Python数据分析,将一个表中的多个列与另一个表中的单个列合并

2024-06-01 02:07:42 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我有这样的桌子:

+----------+------------+----------+------------+----------+------------+-------+
| a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count |
+----------+------------+----------+------------+----------+------------+-------+
| country  | 1          | NAN      | NAN        | NAN      | NAN        | 100   |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 2          | city     | NAN        | NAN      | NAN        | 20    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 2          | city     | NAN        | NAN      | NAN        | 80    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 3          | age      | 4          | sex      | 6          | 40    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 3          | age      | 5          | sex      | 7          | 60    |
+----------+------------+----------+------------+----------+------------+-------+

我想用panadas中LEFT JOIN列的下表来a_name

+----+---------+-------+-------+-------+
| id | a_name  | c01   | c02   | c03   |
+----+---------+-------+-------+-------+
| 1  | country | dtr1  | dtr2  | dtr3  |
+----+---------+-------+-------+-------+
| 2  | region  | dtc1  | dtc2  | dtc3  |
+----+---------+-------+-------+-------+
| 3  | city    | dta1  | dta2  | dta3  |
+----+---------+-------+-------+-------+
| 4  | age     | dtCo1 | dtCo2 | dtCo3 |
+----+---------+-------+-------+-------+
| 5  | sex     | dts1  | dts2  | dts3  |
+----+---------+-------+-------+-------+

我想将列c01, c02 and c03添加到第一个表中列a_name_0, a_name_1 and a_name_2中出现的每个值(country ,region, city, age,sex)。你知道吗

显然,对于出现在a_name_0, a_name_1 and a_name_2列中的每个值,我需要添加三个新列,否则我的表将有不同的行数。行的其余值应该是空的,或者NA或NAN..随便什么。你知道吗

预期输出:

+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| a_name_0 | c01_0 | c01_0 | c01_0 | id_qname_0 | a_name_1 | c01_1 | c01_1 | c01_1 | id_qname_1 | a_name_2 | c01_2 | c01_2 | c01_2 | id_qname_2 | count |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| country  | dtCo1 | dtCo2 | dtCo3 | 1          | NAN      | NAN   | NAN   | NAN   | NAN        | NAN      | NAN   | NAN   | NAN   | NAN        | 70    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   | dtr1  | dtr2  | dtr2  | 2          | city     | dtc1  | dtc2  | dtc3  | NAN        | NAN      | NAN   | NAN   | NAN   | NAN        | 20    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 2          | city     |       |       |       | NAN        | NAN      |       |       |       | NAN        | 20    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 3          | age      |       |       |       | 4          | sex      |       |       |       | 6          | 40    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 3          | age      |       |       |       | 5          | sex      |       |       |       | 7          | 60    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+

说明:

我正在建立数据仓库表,将用于数据分析的目的。报价表(第一个表)应该填充各种项目报价信息(表2),需要直观地表示。你知道吗


Tags: andnameidcityagecountnancountry
2条回答

使用外部联接合并数据帧,并指定要联接表的列(来自每个数据帧)。你知道吗

# Sample data
>>> A
  name_1 name_2  values
0      a      b       1
1      b      c       2
2      c      b       3
3      d      a       4

>>> B
  name  values
0    a       1
1    b       2
2    c       3

>>> C
  name  values
0    a      10
1    b      20
2    c      30

使用^{}方法,可以指定要合并数据帧的列。将how参数设置为outer指定外部联接,这将用NaN填充不匹配的数据点。你知道吗

# Merging
>>> merge1 = A.merge(B, left_on='name_1', right_on='name', how='outer')
>>> merge1
  name_1 name_2  values_x name  values_y
0      a      b         1    a       1.0
1      b      c         2    b       2.0
2      c      b         3    c       3.0
3      d      a         4  NaN       NaN

>>> merge = merge1.merge(C, left_on='name_2', right_on='name', how='outer')
>>> merge
  name_1 name_2  values_x name_x  values_y name_y  values
0      a      b         1      a       1.0      b      20
1      c      b         3      c       3.0      b      20
2      b      c         2      b       2.0      c      30
3      d      a         4    NaN       NaN      a      10

用途:

#convert count column to index for possible processing all another cols by groups
df1 = df1.set_index('count')
#groups by last value after last _
c = df1.columns.str.rsplit('_').str[-1]
#removed unnecessary id column from df2
df2 = df2.drop('id', axis=1)

#for list of DataFrames
dfs = []
#iterate groups
for i, x in df1.groupby(c, axis=1):
    #change columns names for match and for avoid duplicated columns names
    df2.columns = [ f'a_name_{i}'] + (df2.columns + f'_{i}').tolist()[1:]
    #left join
    x = x.merge(df2, on=f'a_name_{i}', how='left')
    #convert duplicates by a_name columns to NaNs
    m = x.duplicated(subset=[x.columns[0]])
    x.iloc[m.to_numpy(), 2:] = np.nan
    #convert id_qname columns to end
    x[f'id_qname_{i}'] = x.pop(f'id_qname_{i}')
    #append to list
    dfs.append(x)

#join together and last add count column from index 
df = pd.concat(dfs, axis=1).assign(count=df1.index)

print (df)
  a_name_0 c01_0 c02_0 c03_0  id_qname_0 a_name_1 c01_0_1 c02_0_1 c03_0_1  \
0  country  dtr1  dtr2  dtr3           1      NaN     NaN     NaN     NaN   
1   region  dtc1  dtc2  dtc3           2     city    dta1    dta2    dta3   
2   region   NaN   NaN   NaN           2     city     NaN     NaN     NaN   
3   region   NaN   NaN   NaN           3      age   dtCo1   dtCo2   dtCo3   
4   region   NaN   NaN   NaN           3      age     NaN     NaN     NaN   

   id_qname_1 a_name_2 c01_0_1_2 c02_0_1_2 c03_0_1_2  id_qname_2  count  
0         NaN      NaN       NaN       NaN       NaN         NaN    100  
1         NaN      NaN       NaN       NaN       NaN         NaN     20  
2         NaN      NaN       NaN       NaN       NaN         NaN     80  
3         4.0      sex      dts1      dts2      dts3         6.0     40  
4         5.0      sex       NaN       NaN       NaN         7.0     60  

相关问题 更多 >