删除停止字(NLTK)时保持格式(换行符)

2024-04-25 12:49:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用NLTK从文件中删除stopwords。该文件是由换行符分隔的一系列tweet。我已经设置了删除停止词,但它也剥离了新行字符,所以它不再是一个推文每行。这是我的密码:

stuff = codecs.open("/Users/user/Desktop/ngrms/Nonsrcstic.txt", "r", encoding="utf-8")
word_list = stuff.readlines()
[x.encode('utf-8') for x in word_list]

f = open('english')
stops = f.read()

for line in word_list:
    for w in line.split('\n'):
        if w.lower() not in stops:
            with open("nostops_Nonsrcstic.txt", "a") as tweetsNoStops:
                tweetsNoStops.write(w.encode('utf-8') + " ")

输入文件如下所示:

 Baby boomers are now at the age where "work or retire" is frequently considered choice. 
 There's a few people I miss but the truth of the matter is, my name probably hasn't crossed their minds or they don't give a shit about me 
 What you must remember is, I do yarn shows with the help of a Fiat Panda and Tatiana, the trailer, which is small #itfitsbehindaPanda  
 @BetBright The AP boost won't work lads says try again later is there a problem with the site?

输出如下所示:

Baby boomers age "work retire" frequently considered choice.  There's people miss truth matter is, name probably hasn't crossed minds don't give shit must remember is, yarn shows help Fiat Panda Tatiana, trailer, small #itfitsbehindaPanda @BetBright AP boost won't work lads says try later problem site?

Tags: 文件theintxtforiswithopen