在python中使用stopwords过滤文件中的行

filtercase =('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z') out = [] ins = open("data.txt","r") for line in ins: for k in filtercase: if(not(line.startswith(k))): out.append(line)

3条回答

网友

1楼 · 编辑于 2024-04-24 07:17:02

以下方法应该有效。

with open('data.txt', 'r') as ins:
    out = filter(lambda line: [sw for sw in filtercase if line.startswith(sw)] == [], ins.readlines())

网友

2楼 · 编辑于 2024-04-24 07:17:02

此解决方案使用regexp，并且只匹配以大写字母开头且不包含stopword中任何单词的行。注意，例如，如果其中一个停止字是'me'，那么'messenger'行将不匹配。

import re

out = []
stopwords = ['no', 'please', 'dont']
lower = re.compile('^[a-z]')
upper = re.compile('^[A-Z]')
with open('data.txt') as ifile:
    for line in ifile:
        if (not lower.match(line) and
            not any(word in line for word in stopwords)) \
            and upper.match(line):
           out.append(line)

网友

3楼 · 编辑于 2024-04-24 07:17:02

原始代码迭代filtercase中的每个字母。如果每一个字母的每一行都不是以它开头的，那么就把它附加到输出列表中。但是很明显，每一行都会被多次追加，因为要使一行不被追加到out，它必须以'a'、'b'、'c'开头，以及过滤器列表中的每一个字母。

相反，您需要遍历filtercase，并且需要找到k的一个实例，其中{}为真。如果line.startswith中有filtercase中的任何短语，则不要追加它；但是如果它成功地迭代了整个列表而不从其任何元素开始，则追加。

Python的for else语法对于检查元素列表非常有用：

out = []

with open('data.txt', 'r') as ins:
    for line in ins:
        for k in filtercase:
            if line.startswith(k): # If line starts with any of the filter words
                break # Else block isn't executed.
        else: # Line doesn't start with filter word, append to message
            out.append(line)

相关问题更多 >

编程相关推荐

热门问题

热门文章