如何对文件中的一行文本进行分词

1 投票

1 回答

2298 浏览

提问于 2025-04-18 10:24

假设文件 shakespeare.txt 里只有这么一句话。这句台词是朱丽叶在《罗密欧与朱丽叶》中说的：

“哦，罗密欧，罗密欧！你为什么是罗密欧？”

然后运行命令 $ shakesort 应该会输出以下内容：

art
o
romeo
thou
wherefore

我目前写的代码是：

def main():
    s = Scanner("shakespeare.txt")
    tokens = ("O Romeo, Romeo! wherefore art thou Romeo?")
    str1 = s.readtoken()
    str2 = s.readtoken()
    str3 = s.readtoken()
    str4 = s.readtoken()
    str5 = s.readtoken()
    str6 = s.readtoken()
    str7 = s.readtoken()
    print(str1)
    print(str2)
    print(str3)
    print(str4)
    print(str5)
    print(str6)
    print(str7)
    s.close
    return 0;

main()

我遇到的问题是，它返回的是整个文件的前7个字符串，而不是我指定的那个词。有没有办法从包含数百万个单词的 shakespeare.txt 文件中选出那7个词，而不需要新建一个文件，只列出这些词呢？

文件操作文本处理数据提取自然语言处理语句解析文本分析分词

1 个回答

像这样：

    uniqwords = {}
    with open('shakespeare.txt') as f:
        for ln in f:
            words = ln.split()
            for word in words:
                word = word.replace('?', '').replace('!', '').replace(',', '').lower()
                uniqwords.setdefault(word, 0)

    for word in sorted(uniqwords.keys()):
        print word

回答于 2025-04-18 由 Python大师

分享举报

如何对文件中的一行文本进行分词

1 个回答

撰写回答