按字符串名称求和

ID Zone1 CHC1 Value1 Zone2 CHC2 Value2 Zone3 CHC3 Value3 total total1 1 R5B 100 10 C2 0 20 R10A 2 5 35 0 1 C2 95 20 M2-6 5 6 R5B 7 3 23 6 3 C2 40 4 C4 60 6 0 6 0 10 0 3 C1 100 8 0 0 0 0 100 0 8 0 5 M1-5 10 6 M2-6 86 15 0 0 0 0 21

2条回答

网友

1楼 · 编辑于 2024-05-28 18:19:04

对于Zones和Values的数据帧，可以使用^{}：

z = df.filter(like='Zone')
v = df.filter(like='Value')

如果需要检查子字符串，则通过^{}和apply创建boolean DataFrames：

m1 = z.apply(lambda x: x.str.contains('R|C'))
m2 = z.apply(lambda x: x.str.contains('M'))

#for check strings
#m1 = z == 'R2'
#m2 = z.isin(['C1', 'C4'])

按每行^{}v和sum最后一个筛选器：

df['t'] = v.where(m1.values).sum(axis=1).astype(int)
df['t1'] = v.where(m2.values).sum(axis=1).astype(int)

print (df)
   ID Zone1  CHC1  Value1 Zone2  CHC2  Value2 Zone3  CHC3  Value3   t  t1
0   1   R5B   100      10    C2     0      20  R10A     2       5  35   0
1   1    C2    95      20  M2-6     5       6   R5B     7       3  23   6
2   3    C2    40       4    C4    60       6     0     6       0  10   0
3   3    C1   100       8     0     0       0     0   100       0   8   0
4   5  M1-5    10       6  M2-6    86      15     0     0       0   0  21

网友

2楼 · 编辑于 2024-05-28 18:19:04

解决方案1（代码更简单，但速度较慢，灵活性较差）

total = []
total1 = []

for i in range(df.shape[0]):
    temp = df.iloc[i].tolist()
    if "R2" in temp:
        total.append(temp[temp.index("R2")+1])
    else:
        total.append(0)
    if ("C1" in temp) & ("C4" in temp):
        total1.append(temp[temp.index("C1")+1] + temp[temp.index("C4")+1])
    else:
        total1.append(0)

df["Total"] = total
df["Total1"] = total1

解决方案2（比解决方案1更快，更易于自定义，但可能占用大量内存）

# columns to use
cols = df.columns.tolist()
zones = [x for x in cols if x.startswith('Zone')]
vals = [x for x in cols if x.startswith('Value')]

# you can customize here
bucket1 = ['R2']
bucket2 = ['C1', 'C4']
thresh = 2 # "OR": 1, "AND": 2

original = df.copy()

# bucket1 check
for zone in zones:
    df.loc[~df[zone].isin(bucket1), cols[cols.index(zone)+1]] = 0

original['Total'] = df[vals].sum(axis=1)
df = original.copy()

# bucket2 check
for zone in zones:
    df.loc[~df[zone].isin(bucket2), cols[cols.index(zone)+1]] = 0

df['Check_Bucket'] = df[zones].stack().reset_index().groupby('level_0')[0].apply(list)
df['Check_Bucket'] = df['Check_Bucket'].apply(lambda x: len([y for y in x if y in bucket2]))
df['Total1'] = df[vals].sum(axis=1)
df.loc[df.Check_Bucket < thresh, 'Total1'] = 0
df.drop('Check_Bucket', axis=1, inplace=True)

当我将原始数据帧扩展到100k行时，解决方案1采用11.4 s ± 82.1 ms per loop，而解决方案2采用3.53 s ± 29.8 ms per loop。区别在于解决方案2不支持行方向上的循环。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章