确保每一行都以标点符号结尾

2024-04-20 01:13:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我从nltk中获取了文本语料库,现在想对其进行处理,以确保文件中的每一行都以标点符号结束。你知道吗

Her mother
had died too long ago for her to
remember her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

应该变成:

Her mother had died too long ago for her to remember her caresses; 
and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.

我试着匹配sed,如果在行尾没有标点符号,但不知道如何向上移动下一行。如果有任何帮助,我将不胜感激!你知道吗


Tags: andtoforplaceagolongtooremember
3条回答

如果你这样用pastesed会怎么样?你知道吗

paste打印同一行中的所有文本。你知道吗

$ paste -s -d' ' file
Her mother had died too long ago for her to remember her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.

sed在每个.;之后添加新行。你知道吗

$ paste -s -d' ' file | sed -r 's/(\.|\;) /\1\n/g'
Her mother had died too long ago for her to remember her caresses;
and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.

在Python中:

import string # for string.punctuation

with open("path/to/file") as f:
    output = ""
    for line in f:
        sanitized = line.strip()
        output += sanitized
        if sanitized[-1] in string.punctuation:
            output += "\n"

with块终止后,output将成为预期的文件。如果需要保持这种状态,可以用output覆盖该文件。你知道吗

使用NLTK的sent_tokenize()

>>> from nltk import sent_tokenize
>>> text = """Her mother
... had died too long ago for her to
... remember her caresses; and her place had been supplied
... by an excellent woman as governess, who had fallen little short
... of a mother in affection."""
>>> sent_tokenize(text.replace("\n", " "))
['Her mother had died too long ago for her to remember her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.']

相关问题 更多 >