检查多列数据格式,并将结果附加到表中的一列

2024-05-13 02:54:36 发布

您现在位置:Python中文网/ 问答频道 /正文

给定一个玩具数据集,如下所示:

   id    room   area           situation
0   1   A-102  world  under construction
1   2     NaN     24  under construction
2   3    B309    NaN                 NaN
3   4   C·102     25    under decoration
4   5  E_1089  hello    under decoration
5   6      27    NaN          under plan
6   7      27    NaN                 NaN

我需要检查三列:room, area, situation基于以下条件:

(1)如果room名称不是数字,字母表,-NaN也被认为是无效的),则返回incorrect room name作为check

(2)如果area不是numberNaN,则返回area is not numbers并将其附加到现有的check

(3)如果situationunder decoration,则返回decoration is in the content并将其附加到现有的check

请注意,我还有其他列要签入实际数据,我需要通过分隔符;附加新的check结果

我怎样才能得到这样的预期结果

   id    room   area           situation                                              check
0   1   A-102  world  under construction                                area is not numbers
1   2     NaN     24  under construction                                                incorrect room name
2   3    B309    NaN                 NaN                                                NaN
3   4   C·102     25    under decoration  incorrect room name; decoration is in the content
4   5  E_1089  hello    under decoration  incorrect room name; area is not numbers; decoration is in the content
5   6      27    NaN          under plan                                                NaN
6   7      27    NaN                 NaN                                                NaN

到目前为止,我的代码是:

房间名称检查:

df['check'] = np.where(df.room.str.match('^[a-zA-Z\d\-]*$'), np.NaN, 'incorrect room name')

输出:

   id    room   area           situation                check
0   1   A-102  world  under construction                  nan
1   2     NaN     24  under construction                  nan
2   3    B309    NaN                 NaN                  nan
3   4   C·102     25    under decoration  incorrect room name
4   5  E_1089  hello    under decoration  incorrect room name
5   6      27    NaN          under plan                  nan
6   7      27    NaN                 NaN                  nan

区域检查:

df['check'] = df['check'].where(df.area.str.contains('^\d+$', na = True),
                                'area is not a numbers') 

输出:

   id    room   area           situation                  check
0   1   A-102  world  under construction  area is not a numbers
1   2     NaN     24  under construction                    nan
2   3    B309    NaN                 NaN                    nan
3   4   C·102     25    under decoration    incorrect room name
4   5  E_1089  hello    under decoration  area is not a numbers
5   6      27    NaN          under plan                    nan
6   7      27    NaN                 NaN                    nan

情况检查:

df['check'] = df['check'].where(df.situation.str.contains('under decoration', na = True),
                                'decoration is in the content') 

输出:

   id    room   area           situation                         check
0   1   A-102  world  under construction  decoration is in the content
1   2     NaN     24  under construction  decoration is in the content
2   3    B309    NaN                 NaN                           nan
3   4   C·102     25    under decoration           incorrect room name
4   5  E_1089  hello    under decoration         area is not a numbers
5   6      27    NaN          under plan  decoration is in the content
6   7      27    NaN                 NaN                           nan

谢谢


Tags: namedfischecknotareananroom
3条回答

我稍微修改了您的条件,使结果更接近您的预期输出:

a = np.where(df.room.str.match('^[a-zA-Z\d\-]*$').notnull(), pd.NA, 'incorrect room name')
b = np.where(df["area"].str.isnumeric() & df["area"].notnull(), pd.NA, 'area is not a numbers')
c = np.where(df.situation.str.contains('under decoration', na = False), 'decoration is in the content', pd.NA)

s = (pd.concat([pd.Series(i, index=df.index) for i in (a, b, c)], axis = 1)
       .stack().groupby(level = 0).agg("; ".join))

print(df.assign(check=s))

   id    room   area           situation                                              check
0   1   A-102  world  under construction                              area is not a numbers
1   2     NaN     24  under construction                                incorrect room name
2   3    B309    NaN                 NaN  area is not a numbers; decoration is in the co...
3   4   C·102     25    under decoration                       decoration is in the content
4   5  E_1089  hello    under decoration  area is not a numbers; decoration is in the co...
5   6      27    NaN          under plan                              area is not a numbers
6   7      27    NaN                 NaN  area is not a numbers; decoration is in the co...

首先^{}更改每个测试的输出,然后zip更改每个数组,如果没有缺少值,则为join应用自定义函数:

a = np.where(df.room.str.match('^[a-zA-Z\d\-]*$', na = False), None,
                               'incorrect room name')
b = np.where(df.area.str.contains('^\d+$', na = True), None,
                                 'area is not a numbers')  
c = np.where(df.situation.str.contains('under decoration', na = False),
                                      'decoration is in the content', None) 


f = (lambda x: ';'.join(y for y in x if pd.notna(y)) 
                if any(pd.notna(np.array(x))) else np.nan )
df['check'] = [f(x) for x in zip(a,b,c)]
print(df)
   id    room   area           situation  \
0   1   A-102  world  under construction   
1   2     NaN     24  under construction   
2   3    B309    NaN                 NaN   
3   4   C·102     25    under decoration   
4   5  E_1089  hello    under decoration   
5   6      27    NaN          under plan   
6   7      27    NaN                 NaN   

                                               check  
0                              area is not a numbers  
1                                incorrect room name  
2                                                NaN  
3   incorrect room name;decoration is in the content  
4  incorrect room name;area is not a numbers;deco...  
5                                                NaN  
6                                                NaN  

您可以尝试以下方法:

import os
import glob
import pandas as pd
os.chdir(r"C:\Users\Rameez PC\Desktop\python data files 2\")

extension = 'xlsx'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

#combine all files in the list
combined_xlsx1 = pd.concat([pd.read_excel(f) for f in all_filenames] )
#export to csv
combined_xlsx1.to_excel( "combined.xlsx", index=False, encoding='utf-8-sig')

相关问题 更多 >