我有一个凌乱的清单(大约10K)要清理,我对在Python中使用正则表达式来实现这一点有些疑问。以下是我的列表的一个小示例:
product_pool=["#101 BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA",
"#W65066CS - Cell phone, Triangle wand & 5 sections lip gloss",
"(Archived)S.O.S. Steel Wool Soap Pads",
"(ARCHIVED) HTH Spa pH Increaser",
"****GLUE STICKS",
"-20°F Splash Windshield Washer Fluid",
"01127 – Fing’rs Mighty Drop, 3g",
"10-01130-Brush On Nail Glue (Three Bond TB1743)",
"Aveeno® Continuous Protection Sunblock Spray Products"]
理想情况下,我想删除像#, *, ®, –, °F
这样的符号,像101, 10-01130-, 01127
这样的数字,以及括号中的世界(Archived), (Three Bond TB1743)
。最终的输出结果是
product_pool=["BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA",
"Cell phone, Triangle wand 5 sections lip gloss",
"S.O.S. Steel Wool Soap Pads",
"HTH Spa pH Increaser",
"GLUE STICKS",
"Splash Windshield Washer Fluid",
"Fing'rs Mighty Drop",
"Brush On Nail Glue",
"Aveeno Continuous Protection Sunblock Spray Products"]
我的方法是用我不想保留的符号来分割产品,然后保留所有的字母。但这种方法似乎效果不太好。所以我很感激你的建议!你知道吗
for product in product_pool:
product_split=re.split(' |, |[) |* |-]', product)
print ' '.join(ch for ch in product_split if ch.isalpha())
输出结果如下:
BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA
Cell phone Triangle wand sections lip gloss
Steel Wool Soap Pads (S.O.S. is missing)
HTH Spa pH Increaser
GLUE STICKS
Splash Windshield Washer Fluid
Mighty Drop (Fing'rs is missing)
Brush On Nail Glue Bond
Continuous Protection Sunblock Spray Products (Aveeno is missing)
还有一些额外的空间,但这可能是一种方法。你知道吗
你可以玩你想保留的角色,在string constants中查看
string.punctuation
、string.ascii_letters
等可以用regex替换^{} 。你知道吗
正则表达式模式
[^...]
匹配任何不在...
中的内容。然后可以使用re.sub
将所有这些匹配项替换为空字符串,从而有效地删除它们。模式的第二项匹配archived
块,(?i)
告诉它忽略这些块的大小写。你知道吗相关问题 更多 >
编程相关推荐