根据前4个字符将单词列表与其自身进行比较

from collections import Counter mylist=list() with open('test.txt', 'r') as f: for i in f.readlines(): mylist.append(i[:4]) myn=Counter(mylist) import pandas as pd mys=pd.Series(myn) myindex=list(mys[mys > 2].index) newlist=list() for x in myindex: with open('test.txt', 'r') as f: for i in f.readlines(): if x == i[:4]: newlist.append(i)

3条回答

网友

1楼 · 编辑于 2024-05-23 22:21:51

代码的主要瓶颈是您要读取文件两次。对于大文件，最终结果是您将花费至少两倍的时间

如果你能把文件的全部内容都保存在内存中，我会做如下的事情：（前面的一个答案已经提出了这一点，但是使用了defaultdict）

Words = dict()
with open('test.txt', 'r') as File:
    for line in File:
        key = line[:4]
        if Words[key]:
            Words[key].append(line)
        else:
            Words[key] = [line,]
Output = []
for key,items in Words.items():
    if len(items) > 2: 
        Output.extend(items)

如果您无法将内容保存在内存中，您将被迫再次读取文件，因此一个选项是将行号存储在字典中，并在第二次读取时仅打印存储的行号：

Words = dict()
with open('test.txt', 'r') as File:
    for i,line in enumerate(File):
        key = line[:4]
        if Words[key]:
            Words[key].append(i)
        else:
            Words[key] = [i,]
LineNumbers = set()
for key,items in Words.items():
    if len(items) > 2: 
        LineNumbers.update(items)
Output = []
with open('test.txt', 'r') as File:
    for i,line in enumerate(File):
        if i in LineNumbers:
            Output.append(line)

注意：如果调用File.readlines（），则在for循环的生命周期内，在迭代过程中，您已经在内存中保存了列表中的文件内容。如果您使用“for line in File”逐行迭代，我认为迭代是通过按需读取该行来完成的

网友

2楼 · 编辑于 2024-05-23 22:21:51

awk '
{
     n = substr($0,1,4);
     c[n]++;
     w[n] = (length(w[n]) ? w[n]"\n" : "") $0
}
END{ for (n in c) if (c[n] > 2) print w[n] }'

n=substr...-提取前4个字符-这是我们的索引
c[n]++-保持计数
w[n]=...-记住用换行符分隔的单词
for(n in c)if(c[n]>2)print w[n]-对于每个单词，如果计数大于2，则打印该单词

网友

3楼 · 编辑于 2024-05-23 22:21:51

使用GNU awk表示数组的数组，并假设您需要唯一字的计数：

$ cat tst.awk
{
    key = substr($0,1,4)
    words[key][$0]
}
END {
    for ( key in words ) {
        if ( length(words[key]) > 2 ) {
            for ( word in words[key] ) {
                print word
            }
        }
    }
}

$ awk -f tst.awk file
tested
tests
testing

相关问题更多 >

编程相关推荐

热门问题

热门文章