用Python清理凌乱的字符串

2024-04-29 07:11:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个凌乱的清单(大约10K)要清理,我对在Python中使用正则表达式来实现这一点有些疑问。以下是我的列表的一个小示例:

product_pool=["#101 BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA", 
              "#W65066CS - Cell phone, Triangle wand & 5 sections lip gloss", 
              "(Archived)S.O.S. Steel Wool Soap Pads", 
              "(ARCHIVED) HTH Spa pH Increaser",
              "****GLUE STICKS",
              "-20°F Splash Windshield Washer Fluid",
              "01127 – Fing’rs Mighty Drop, 3g",
              "10-01130-Brush On Nail Glue (Three Bond TB1743)",
              "Aveeno® Continuous Protection Sunblock Spray Products"]

理想情况下,我想删除像#, *, ®, –, °F这样的符号,像101, 10-01130-, 01127这样的数字,以及括号中的世界(Archived), (Three Bond TB1743)。最终的输出结果是

product_pool=["BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA", 
              "Cell phone, Triangle wand 5 sections lip gloss", 
              "S.O.S. Steel Wool Soap Pads", 
              "HTH Spa pH Increaser",
              "GLUE STICKS",
              "Splash Windshield Washer Fluid",
              "Fing'rs Mighty Drop",
              "Brush On Nail Glue",
              "Aveeno Continuous Protection Sunblock Spray Products"]

我的方法是用我不想保留的符号来分割产品,然后保留所有的字母。但这种方法似乎效果不太好。所以我很感激你的建议!你知道吗

for product in product_pool:
    product_split=re.split(' |, |[) |* |-]', product)
    print ' '.join(ch for ch in product_split if ch.isalpha())

输出结果如下:

BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA
Cell phone Triangle wand sections lip gloss
Steel Wool Soap Pads (S.O.S. is missing)
HTH Spa pH Increaser
GLUE STICKS
Splash Windshield Washer Fluid
Mighty Drop (Fing'rs is missing)
Brush On Nail Glue Bond
Continuous Protection Sunblock Spray Products (Aveeno is missing)

Tags: phonecellwandproductbumpskinrazorpool
2条回答
product_pool=["#101 BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA", 
              "#W65066CS - Cell phone, Triangle wand & 5 sections lip gloss", 
              "(Archived)S.O.S. Steel Wool Soap Pads", 
              "(ARCHIVED) HTH Spa pH Increaser",
              "****GLUE STICKS",
              "-20°F Splash Windshield Washer Fluid",
              "01127 – Fing’rs Mighty Drop, 3g",
              "10-01130-Brush On Nail Glue (Three Bond TB1743)",
              "Aveeno® Continuous Protection Sunblock Spray Products"]

还有一些额外的空间,但这可能是一种方法。你知道吗

import string
goodChars = string.ascii_letters + '.' + ' '
cleaned = [''.join(i for i in word if i in goodChars) for word in product_pool]

>>> cleaned
[' BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA',
 'WCS  Cell phone Triangle wand   sections lip gloss',
 'ArchivedS.O.S. Steel Wool Soap Pads',
 'ARCHIVED HTH Spa pH Increaser',
 'GLUE STICKS',
 'F Splash Windshield Washer Fluid',
 '  Fingrs Mighty Drop g',
 'Brush On Nail Glue Three Bond TB',
 'Aveeno Continuous Protection Sunblock Spray Products']

你可以玩你想保留的角色,在string constants中查看string.punctuationstring.ascii_letters

可以用regex替换^{}。你知道吗

import re

pattern = '[^a-zA-Z\s]|(?i)archived'
results = [re.sub(pattern, '', s).strip() for s in product_pool]
# ['BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA',
#  'WCS  Cell phone Triangle wand   sections lip gloss',
#  'SOS Steel Wool Soap Pads',
#  'HTH Spa pH Increaser',
#  'GLUE STICKS',
#  'F Splash Windshield Washer Fluid',
#  'Fingrs Mighty Drop g',
#  'Brush On Nail Glue Three Bond TB',
#  'Aveeno Continuous Protection Sunblock Spray Products']

正则表达式模式[^...]匹配任何不在...中的内容。然后可以使用re.sub将所有这些匹配项替换为空字符串,从而有效地删除它们。模式的第二项匹配archived块,(?i)告诉它忽略这些块的大小写。你知道吗

相关问题 更多 >