从上的python字符串中删除所有可能不需要的字符

Often hailed as Hollywood long standing, commercially successful filmmaker, Spielberg lifetime gross, if you include his productions, reaches a mammoth $17.2 billion unadjusted for inflation. The original Jurassic Park ($983.8 million worldwide), which released in 1993, remains Spielberg highest grossing film. Ready Player One,currently advancing at a running total of $476.1 million, has become Spielberg seventh highest grossing film of his career. It will eventually supplant Aamir 2017 blockbuster Dangal (1.29 billion yuan) if it achieves the Maoyan lifetime forecast of 1.31 billion yuan ($208 million) in the PRC

3条回答

网友

1楼 · 编辑于 2024-04-19 16:08:13

首先使用.encode('ascii',errors='ignore')忽略所有非ASCII字符。在

如果您需要此文本进行某种情感分析，那么您可能还希望删除特殊字符，如\n，\r，等等，这可以通过首先转义转义字符，然后用regex的帮助替换它们来完成。在

from newspaper import Article
import re
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
text = text.encode('ascii',errors='ignore')
text = str(text) #converts `\n` to `\\n` which can then be replaced by regex
text = re.sub('\\\.','',text) #Removes all substrings of form \\.
print (text)

网友

2楼 · 编辑于 2024-04-19 16:08:13

您可以使用python的encode/decode删除所有非拉丁字符

data = text.decode('utf-8')
text = data.encode('latin-1', 'ignore')

网友

3楼 · 编辑于 2024-04-19 16:08:13

尝试使用正则表达式：

import re
clear_str = re.sub(r'[\xe2\x80\x99s]', '', your_input)

re.sub用第二个参数替换your_input中出现的所有模式。类模式[abc]匹配a、b或{}字符。在

相关问题更多 >

编程相关推荐

热门问题

热门文章