作为Python中信息检索项目(构建小型搜索引擎)的一部分,我希望保持下载tweets的干净文本(tweets的csv数据集-确切地说是27000个tweets),tweet看起来像:
"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL
或者
"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX
我想用regex删除tweet中不必要的部分,比如URL、标点符号等等
所以结果是:
"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"
以及
"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"
尝试过:pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]')
,但效果并不理想,例如URL的一部分仍然存在于结果中。你知道吗
请帮我找到一个正则表达式模式,将做我想要的。你知道吗
这也许会有帮助。你知道吗
演示:
相关问题 更多 >
编程相关推荐