不使用nltk库去除停用词

0 投票

3 回答

5521 浏览

提问于 2025-04-18 12:15

我想在一个文本文件中去掉一些常见的无意义词（停用词），而不使用nltk这个库。我有三个文本文件，分别是f1、f2和f3。f1里面是逐行的文本，f2里面是停用词的列表，而f3是一个空文件。

我的想法是逐行读取f1的内容，然后再逐个单词检查这些单词是否在f2的停用词列表中。如果某个单词不在停用词列表里，就把这个单词写入f3。这样，最后f3里的内容就应该和f1一样，但每一行中停用词都被去掉了。

f1 = open("file1.txt","r")
f2 = open("stop.txt","r")
f3 = open("file2.txt","w")

for line in f1:
    words = line.split()
    for word in words:
        t=word

for line in f2:
    w = line.split()
    for word in w:
        t1=w
        if t!=t1:
            f3.write(word)

f1.close()
f2.close()
f3.close()

这段代码是错的。不过有没有人能通过修改代码来完成这个任务呢？

提前谢谢大家。

文本处理文本文件文件写入停用词逐行读取自定义算法单词检查

3 个回答

我个人会做的是，遍历停用词列表（f2），然后把每个词添加到你脚本中的一个列表里。比如：

stoplist = []
file1 = open('f1.txt','r')
file2 = open('f2.txt','r')
file3 = open('f3.txt','a') # append mode. Similar to rw
for line in f2:
    w = line.split()
    for word in w:
        stoplist.append(word)
#end 
for line in file1:
    w = line.split()
    for word in w:
        if word in stoplist: continue
        else: 
            file3.write(word)
#end 
file1.close()
file2.close()
file3.close()

回答于 2025-04-18 由 Python大师

分享举报

你的第一个循环写错了，因为这条命令 for word in words: t=word 只是把每个单词赋值给了 t，而没有把所有的单词都放到 t 里。这里的 words 是一个列表，你可以对它进行操作。而且如果你的文件有多行的话，你的列表可能不会包含所有的单词！！你必须这样做！这样才能正确运行！

f1 = open("a.txt","r")
f2 = open("b.txt","r")
f3 = open("c.txt","w")
first_words=[]
second_words=[]
for line in f1:
 words = line.split()
 for w in words:
  first_words.append(w)

for line in f2:
 w = line.split()
 for i in w:
  second_words.append(i)


for word1 in first_words :
 for word2 in second_words:
   if word1==word2:
    first_words.remove(word2)

for word in first_words:
 f3.write(word)
 f3.write(' ')

f1.close()
f2.close()
f3.close()

回答于 2025-04-18 由 Python大师

分享举报

你可以使用Linux中的Sed工具来去除停用词。

sed -f <(sed 's/.*/s|\\\<&\\\>||g/' stopwords.txt) all_lo.txt > all_remove1.txt

回答于 2025-04-18 由 Python大师

分享举报

不使用nltk库去除停用词

3 个回答

撰写回答