我使用python模块newspaper3k
,并使用其web url提取文章摘要。因为
from newspaper import Article
article = Article('https://www.abcd....vnn.com/dhdhd')
article.download()
article.parse()
article.nlp()
text = article.summary
print (text)
给予
^{pr2}$我只想删除所有不需要的字符,如\xe2\x80\x99s
。我避免使用多个replace
函数。我只想要点什么比如:在
Often hailed as Hollywood long standing, commercially successful filmmaker,
Spielberg lifetime gross, if you include his productions, reaches a
mammoth $17.2 billion unadjusted for inflation.
The original Jurassic Park ($983.8 million worldwide),
which released in 1993, remains Spielberg highest grossing film.
Ready Player One,currently advancing at a running total of $476.1 million,
has become Spielberg seventh highest grossing film of his career.
It will eventually supplant Aamir 2017 blockbuster Dangal (1.29 billion yuan)
if it achieves the Maoyan lifetime forecast of 1.31 billion yuan ($208 million) in the PRC
首先使用
.encode('ascii',errors='ignore')
忽略所有非ASCII字符。在如果您需要此文本进行某种情感分析,那么您可能还希望删除特殊字符,如
\n
,\r
,等等,这可以通过首先转义转义字符,然后用regex的帮助替换它们来完成。在您可以使用python的
encode
/decode
删除所有非拉丁字符尝试使用正则表达式:
re.sub
用第二个参数替换your_input
中出现的所有模式。类模式[abc]
匹配a
、b
或{相关问题 更多 >
编程相关推荐