停用词去除无法正常工作

2024-04-20 07:58:25 发布

您现在位置:Python中文网/ 问答频道 /正文

你知道为什么停止字删除不能正常工作吗?它错误地替换了内容,有时用an替换say a,或者不能将it's视为单个单词。你知道吗

stop_words=open("stopwords.txt")
stop_words=stop_words.read().split("\n")
print stop_words
for line in splitted_tweets:
    #print line
    #print "***************************************"
    if (line.__contains__("text='")):
        start_index=line.index("text='")+6
        end_index=line.index("',", start_index)
        tweet=line[start_index:end_index]
        print tweet
        print "**********"
        tweet_words = re.sub("[^\w]", " " , tweet).split()
        print tweet_words
        for word in stop_words:
                if word in tweet_words:
                        print word
                        tweet=tweet.replace(word, "")

        print "?????????????????????????"
        print tweet

下面是一些示例输出:

['RT', 'sayingsforgirls', 'Do', 'not', 'touch', 'MY', 'iPhone', 'It', 's', 'not', 'an', 'usPhone', 'it', 's', 'not', 'a', 'wePhone', 'it', 's', 'not', 'an', 'ourPhone', 'it', 's', 'an', 'iPhone']
a
an
it
not
?????????????????????????
RT @syingsforgirls: Do  touch MY iPhone. It's  n usPhone, 's   wePhone, 's  n ourPhone, 's n iPhone.
Do not touch MY iPhone. It's not an usPhone, it's not a wePhone, it's not an ourPhone, it's an iPhone.
**********
['Do', 'not', 'touch', 'MY', 'iPhone', 'It', 's', 'not', 'an', 'usPhone', 'it', 's', 'not', 'a', 'wePhone', 'it', 's', 'not', 'an', 'ourPhone', 'it', 's', 'an', 'iPhone']
a
an
it
not
?????????????????????????
Do  touch MY iPhone. It's  n usPhone, 's   wePhone, 's  n ourPhone, 's n iPhone.
RT @BrianaaSymonee: she says imma dog, but it takes one to know one...
**********
['RT', 'BrianaaSymonee', 'she', 'says', 'imma', 'dog', 'but', 'it', 'takes', 'one', 'to', 'know', 'one']
but
it
she
to
?????????????????????????
RT @BrianaaSymonee:  says imma dog,   takes one  know one...
she says imma dog, but it takes one to know one...
**********

Tags: anindexmylinenotitdoone