如何删除包含3个以上非SCII字符的行

2024-06-02 08:25:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我想删除sms列包含超过3个垃圾值的所有记录/行,简单地说,我想删除下面给定数据帧中的第4行和第5行

id    city    department    sms                    category
01    khi      revenue      quk respns.                1
02    lhr      revenue      good.                      1
03    lhr      revenue      greatœ1øið                 0
04    isb      accounts     ?xœ1øiûüð÷üœç8i            0
05    isb      accounts     %â¡ã‘ã¸$ãªã±t%rã«ãÿã©â£    0

预期数据帧:

id city department        sms   category
1  khi    revenue  quk respns.         1
2  lhr    revenue        good.         1
3  lhr    revenue   greatœ1øið         0

Tags: 数据idcitysmsdepartmentgoodaccountscategory
3条回答

ascii表只扩展到127,这意味着如果我们做一个ord(<character>)并得到一个大于127的值,那么这不是一个有效的ascii字符

使用此方法,我们可以计算非Ascii字符的数量,并且只返回True,其中有3个或更少

df.drop(df.loc[df["sms"].apply(lambda x: False if len([i for i in x if ord(i) > 127]) <= 3 else True)].index)

输出:

   id city department          sms  category
0   1  khi    revenue  quk respns.         1
1   2  lhr    revenue        good.         1
2   3  lhr    revenue   greatœ1øið         0
ascii_string = set("""!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~""")
for r, j in df.iterrows(): 
  for k , data in j.iteritems():
    total_count = len(data)
    ascii_count = sum(c in ascii_string for c in data)
    non_ascii_count = total_count - ascii_count
    if non_ascii_count > 3:
      #remove row
      df = df.drop([r])
      break

我们可以使用^{}来计算列sms中每个字符串中正则表达式模式[^\x00-\x7F]匹配单个非ASCII字符)的出现次数,然后使用^{}来创建boolean mask,并使用此掩码来过滤行:

m = df['sms'].str.count(r'[^\x00-\x7F]').gt(3)
df = df[~m]

结果:

   id city department          sms  category
0   1  khi    revenue  quk respns.         1
1   2  lhr    revenue        good.         1
2   3  lhr    revenue   greatœ1øið         0

相关问题 更多 >