保持文本与u保持干净

2024-04-25 04:46:49 发布

您现在位置:Python中文网/ 问答频道 /正文

作为Python中信息检索项目(构建小型搜索引擎)的一部分,我希望保持下载tweets的干净文本(tweets的csv数据集-确切地说是27000个tweets),tweet看起来像:

"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL

或者

"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX

我想用regex删除tweet中不必要的部分,比如URL、标点符号等等

所以结果是:

"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"

以及

"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"

尝试过:pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'),但效果并不理想,例如URL的一部分仍然存在于结果中。你知道吗

请帮我找到一个正则表达式模式,将做我想要的。你知道吗


Tags: thetoinlivebasicwithourare
1条回答
网友
1楼 · 发布于 2024-04-25 04:46:49

这也许会有帮助。你知道吗

演示:

import re

s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL"""    

def cleanString(text):
    res = []
    for i in text.strip().split():
        if not re.search(r"(https?)", i):   #Removes URL..Note: Works only if http or https in string.
            res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " "))   #Strip everything that is not alphabet(Upper or Lower)
    return " ".join(map(str.strip, res))

print(cleanString(s1))
print(cleanString(s2))

相关问题 更多 >