选择至少有一列中的值为负的行

2024-06-16 10:24:07 发布

您现在位置:Python中文网/ 问答频道 /正文

给定一个数据帧:

df = pd.DataFrame(
    {'AgeAtMedStart': {1: -46.47, 2: 46.47, 3: 46.8, 4: 51.5, 5: 51.5},
     'AgeAtMedStop': {1: 46.8, 2: 46.8, 3: nan, 4: -51.9, 5: 51.81},
     'MedContinuing': {1: 'No', 2: 'No', 3: 'Yes', 4: 'No', 5: 'No'},
     'Medication': {1: 'Med1', 2: 'Med2', 3: 'Med3', 4: 'Med4', 5: 'Med4'},
     'YearOfMedStart': {1: 2016.0, 2: 2016.0, 3: 2016.0, 4: 2016.0, 5: 2016.0}}
)

df 
   AgeAtMedStart  AgeAtMedStop MedContinuing Medication  YearOfMedStart
1         -46.47         46.80            No       Med1          2016.0
2          46.47         46.80            No       Med2          2016.0
3          46.80           NaN           Yes       Med3          2016.0
4          51.50        -51.90            No       Med4          2016.0
5          51.50         51.81            No       Med4          2016.0

我想过滤以保留“AgeAt*”列中任何数值为负的行。在

我对这个输出的预期输出是索引为1的行,因为“AgeAtMedStart”的值为-46.47,而索引为4的行的值为-51.9,因此输出为

^{pr2}$

编辑1:

因此,我尝试了到目前为止提供的不同答案,但都返回一个空的数据帧。我相信问题的一部分是我有另一个名为AgeAtMedStartFlag(和AgeAtMedStopFlag)的列包含字符串。所以对于这个csv示例:

RecordKey   Medication  CancerSiteForTreatment  CancerSiteForTreatmentCode  TreatmentLineCodeKey    AgeAtMedStart   AgeAtMedStartFlag   YearOfMedStart  MedContinuing   AgeAtMedStop    AgeAtMedStopFlag    ChangeOfTreatment
1   Drug1   Site1   C1.0    First   -46.47  Year And Month Are Known But Day Is Missing And Coded To 15 2016    No  46.8    Year And Month Are Known But Day Is Missing And Coded To 15 Yes
1   Drug2   Site2   C1.1    First   46.47   Year And Month Are Known But Day Is Missing And Coded To 15 2016    No  46.8    Year And Month Are Known But Day Is Missing And Coded To 15 Yes
1   Drug3   Site3   C1.2    First   46.8    Year And Month Are Known But Day Is Missing And Coded To 15 2016    Yes         Yes
2   Drug4   Site4   C1.3    First   51.5        2016    No  51.9        Yes
2   Drug5   Site5   C1.4    First   51.5        2016    No  -51.81      Yes
3   Drug6   Site6   C1.5    First   73.93       2016    No  74.42       Yes
3   Drug7   Site7   C1.6    First   73.93       2016    No  74.42       Yes
4   Drug8   Site8   C1.7    First   36.66       2015    No  37.24       Yes
4   Drug9   Site9   C1.8    First   36.66       2015    No  37.24       Yes
4   Drug10  Site10  C1.9    First   36.66       2015    No  37.24       Yes
9   Drug11  Site11  C1.10   First   43.55       2016    No  43.68       Yes
9   Drug12  Site12  C1.11   First   43.22       2016    No  43.49       Yes
9   Drug13  Site13  C1.12   First   43.55       2016    No  43.68       Yes
9   Drug14  Site14  C1.13   First   43.22       2016    No  43.49       Yes
10  Drug15  Site15  C1.14   First   74.42       2016    No  74.84       Yes
10  Drug16  Site16  C1.15   First   73.56       2015    No  73.98       Yes
10  Drug17  Site17  C1.16   First   73.56       2015    No  73.98       No
10  Drug18  Site18  C1.17   First   74.42       2016    No  74.84       No
10  Drug19  Site19  C1.18   First   73.56       2015    No  73.98       No
10  Drug20  Site20  C1.19   First   74.42       2016    No  74.84       No
11  Drug21  Site21  C1.20   First   70.72       2013    No  72.76       No
11  Drug22  Site22  C1.21   First   68.76       2011    No  70.62       No
11  Drug23  Site23  C1.22   First   73.43       2016    No  73.96       No
11  Drug24  Site24  C1.23   First   72.76       2015    No  73.43       No

对我的脚本进行了以下更改:

age_df = df.columns[(df.columns.str.startswith('AgeAt')) & (~df.columns.str.endswith('Flag'))]

df[df[age_df] < 0].to_excel('invalid.xlsx', 'Benjamin_Button')

它返回:

RecordKey   Medication  CancerSiteForTreatment  CancerSiteForTreatmentCode  TreatmentLineCodeKey    AgeAtMedStart   AgeAtMedStartFlag   YearOfMedStart  MedContinuing   AgeAtMedStop    AgeAtMedStopFlag    ChangeOfTreatment
1                   -46.47                     
1                                          
1                                          
2                                          
2                                   -51.81     
3                                          
3                                          
4                                          
4                                          
4                                          
9                                          
9                                          
9                                          
9                                          
10                                         
10                                         
10                                         
10                                         
10                                         
10                                         
11                                         
11                                         
11                                         
11

我可以修改这个实现来只返回负数所在的行,如果可能的话,返回这些行的其余值吗?或者更好,只需负的年龄和该行的RecordKey。在


Tags: andnodfisyearareyesbut
2条回答

这里有一个简单的一句话给你。如果您需要从逻辑上确定列是否为数字,请参考coldspeed的答案。但是,如果您对显式列引用没有意见,那么这样一个简单的方法就可以了。在

注意,我还用0填充NaN;这将满足您的要求,即使数据丢失。Nan可以用其他方式处理,但这里就足够了。如果您希望保留的其他列中缺少值,也可以这样做(为了简单起见,我没有在这里包含它)。在

myData = df.fillna(0).query('AgeAtMedStart < 0 or AgeAtMedStop < 0')

退货:

^{pr2}$

Pandas原生查询方法对于简单的过滤器表达式非常方便。在

更多信息请参阅文档:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

首先获取感兴趣的列:

cols = [col for col in df if col.lower().startswith('AgeAt')]

然后用这些列得到DF:

^{pr2}$

然后得到行:

x = df_wanted[df_wanted < 0]

当然,如果您正在查看多个列,某些单元格将包含nan。在

相关问题 更多 >