从上的python字符串中删除所有可能不需要的字符

2024-04-19 16:08:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用python模块newspaper3k,并使用其web url提取文章摘要。因为

from newspaper import Article
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
print (text)

给予

^{pr2}$

我只想删除所有不需要的字符,如\xe2\x80\x99s。我避免使用多个replace函数。我只想要点什么比如:在

Often hailed as Hollywood long standing, commercially successful filmmaker, 
Spielberg lifetime gross, if you include his productions, reaches a 
mammoth $17.2 billion unadjusted for inflation.
The original Jurassic Park ($983.8 million worldwide), 
which released in 1993, remains Spielberg highest grossing film.
Ready Player One,currently advancing at a running total of $476.1 million, 
has become Spielberg seventh highest grossing film of his career.
It will eventually supplant Aamir 2017 blockbuster Dangal (1.29 billion yuan) 
if it achieves the Maoyan lifetime forecast of 1.31 billion yuan ($208 million) in the PRC

Tags: oftextinifarticlefilmlifetimehis
3条回答

首先使用.encode('ascii',errors='ignore')忽略所有非ASCII字符。在

如果您需要此文本进行某种情感分析,那么您可能还希望删除特殊字符,如\n\r,等等,这可以通过首先转义转义字符,然后用regex的帮助替换它们来完成。在

from newspaper import Article
import re
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
text = text.encode('ascii',errors='ignore')
text = str(text) #converts `\n` to `\\n` which can then be replaced by regex
text = re.sub('\\\.','',text) #Removes all substrings of form \\.
print (text)

您可以使用python的encode/decode删除所有非拉丁字符

data = text.decode('utf-8')
text = data.encode('latin-1', 'ignore')

尝试使用正则表达式:

import re
clear_str = re.sub(r'[\xe2\x80\x99s]', '', your_input)

re.sub用第二个参数替换your_input中出现的所有模式。类模式[abc]匹配ab或{}字符。在

相关问题 更多 >