使用python3显示(标记化)文本的bigram会导致pip中断

2024-04-24 16:58:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我的代码:

import sys

def byFreq(pair):
    return pair[1]

def main():

    bigrams = {}

    for line in sys.stdin:

        line = line.lower()
        words = line.split()

        for i in range (len(words)-1):

            bigram = (words[i],words[i+1])
            bigrams[bigram] = bigrams.get(bigram,0) + 1

    bigrams = list(bigrams.items())
    bigrams.sort(key=byFreq, reverse=True)

    for i in range(len(bigrams)):
        bg, count = bigrams[i]
        print("{0:<15}{1:<15}{2:>5}" .format(bg[0], bg[1], count))


if __name__ == "__main__":
    main()

我希望能够在命令行中使用python3文件,例如cat myfile.txt | python3 bigrams.py | head -5

这样执行我的文件会产生以下输出(使用MacOS终端):

van            de                25
in             de                14
aan            de                10
in             het                9
de             regering           9
Traceback (most recent call last):
  File "bigram.py", line 37, in <module>
    main()
  File "bigram.py", line 33, in main
    print("{0:<15}{1:<15}{2:>5}" .format(bg[0], bg[1], count))
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

它确实打印了5行,但也有一个断管错误。这可以通过以下方法解决:

import signal
signal.signal(signal.SIGPIPE, signal.SIG_DFL)

然而,这似乎不是消除错误的好方法。有没有其他更好的办法

还有,有没有更好的方法来获取bigram作为输出

希望有人能帮我

干杯,Thijmen


Tags: 方法inpyimportforsignalmaincount
1条回答
网友
1楼 · 发布于 2024-04-24 16:58:02

您需要将stderr重定向到stdout,然后head顶行:

cat myfile.txt | python3 bigrams.py 2>&1 | head -5

我建议将输入文件名和要打印到标准的行数作为命令行参数传递:

def main():
    bigrams = {}
    #pass input filename as the first argument
    ifilename = sys.argv[1]  
    lines = open(ifilename,"r").readlines()
    #pass number of lines to print as a second argument 
    show_top_n = int(sys.argv[2])

    for line in lines:
        line = line.lower()
        words = line.split()

        for i in range (len(words)-1):
            bigram = (words[i],words[i+1])
            bigrams[bigram] = bigrams.get(bigram,0) + 1

    bigrams = list(bigrams.items())
    bigrams.sort(key=byFreq, reverse=True)

    for i in range(show_top_n):
        bg, count = bigrams[i]
        print("{0:<15}{1:<15}{2:>5}" .format(bg[0], bg[1], count))

您可以这样启动它:

python bigrams.py myfile.txt 5

相关问题 更多 >