Python - 从行中给定点获取前后五个词的最佳代码

3 投票

5 回答

647 浏览

提问于 2025-04-16 14:20

我正在尝试写一段代码，目的是找到某个特定短语前后各5个单词。这听起来简单，但因为我需要处理的数据量非常大，所以代码必须要高效！

for file in listing:
    file2 = open('//home/user/Documents/Corpus/Files/'+file,'r')
    for line in file2:
        linetrigrams = trigram_split(line)
        for trigram in linetrigrams:
            if trigram in trigrams:
                line2 = line.replace(trigram,'###').split('###')
                window = (line2[0].split()[-5:] + line2[1].split()[:5])
                for item in window:
                    if item in mostfreq:
                        matrix[trigram][mostfreq[item]] += 1

有没有什么建议可以让我做得更快？可能我现在用的数据结构完全不对。trigram_split()这个函数只是从一行文本中提取出所有的三元组（trigram），而我需要为这些三元组创建向量。所谓“三元组”，就是我需要处理的约一百万个三元组的列表。Window函数可以获取三元组前后各5个单词（前提是这个三元组在我的列表中），然后检查这些单词是否在MostFreq这个字典里（MostFreq是一个包含1000个单词的字典，每个单词对应一个整数值[0-100]）。接着，这些信息会用来更新一个矩阵（Matrix），这个矩阵是一个字典，存储的值是包含1000个元素的列表（每个元素初始为0）。通过这种方式，伪矩阵中对应的值会被增加。

数据结构文本处理高效算法三元组窗口函数词频分析短语提取矩阵更新

5 个回答

通常，大量数据会存储在文件中；如果把这些文件加载到内存中再去搜索，可能会让搜索变得很慢，所以最好直接在磁盘上搜索文件。不过，如果你想查找某个短语前面的5个词，就需要随机访问，记得这一点。

关于搜索，Boyer–Moore 字符串搜索算法通常比简单的方法要快。如果你在处理随机访问的文件，这样做也可以避免把整个文件都读入内存。

特别是如果你的短语经常变化，但数据不变，那就可以考虑使用一些全文搜索引擎，比如Python 的全文搜索引擎。不过，这种方法可能对这种作业来说有点过于复杂。

回答于 2025-04-16 由 Python大师

分享举报

你可以使用re模块。

import re

f = open('filepath', 'r')
txt = f.read()
# 'hey' is the search phrase
phrase = 'hey'
matches = re.findall(r'(\w+\s+\w+\s+\w+\s+\w+\s+\w+)\s+%s\s+(\w+\s+\w+\s+\w+\s+\w+\s+\w+)' % phrase, txt)

这样你就能找到文件中的所有匹配项了。不过，你还需要用os.walk来获取所有的文件。

回答于 2025-04-16 由 Python大师

分享举报

在考虑不同方法时，有几个重要因素需要注意：

多行和单行的选择
每行的长度
搜索模式的长度
搜索匹配的速度
如果前后少于5个词该怎么办
如何处理非单词和非空格的字符（比如换行符和标点符号）
是否不区分大小写？
如果有重叠的匹配该怎么处理？比如，如果文本是We are the knights who say NI! NI NI NI NI NI NI NI NI，你搜索NI，你会返回什么？这种情况会发生在你身上吗？
如果你的数据中有###该怎么办？
你更愿意错过一些匹配，还是返回一些额外的错误结果？在处理混乱的现实数据时，可能会有一些权衡。

你可以尝试使用正则表达式...

import re
zen = """Beautiful is better than ugly. \
Explicit is better than implicit. \
Simple is better than complex. \
Complex is better than complicated. \
Flat is better than nested. \
Sparse is better than dense. \
Readability counts. \
Special cases aren't special enough to break the rules. \
Although practicality beats purity. \
Errors should never pass silently. \
Unless explicitly silenced. \
In the face of ambiguity, refuse the temptation to guess. \
There should be one-- and preferably only one --obvious way to do it. \
Although that way may not be obvious at first unless you're Dutch. \
Now is better than never. \
Although never is often better than *right* now. \
If the implementation is hard to explain, it's a bad idea. \
If the implementation is easy to explain, it may be a good idea. \
Namespaces are one honking great idea -- let's do more of those!"""

searchvar = 'Dutch'
dutchre = re.compile(r"""((?:\S+\s*){,5})(%s)((?:\S+\s*){,5})""" % searchvar, re.IGNORECASE | re.MULTILINE)
print dutchre.findall(zen)
#[("obvious at first unless you're ", 'Dutch', '. Now is better than ')]

另一种方法，个人认为结果会更差...

def splitAndFind(text, phrase):
    text2 = text.replace(phrase, "###").split("###")
    if len(text2) > 1:
        return ((text2[0].split()[-5:], text2[1].split()[:5]))
print splitAndFind(zen, 'Dutch')
#(['obvious', 'at', 'first', 'unless', "you're"],
# ['.', 'Now', 'is', 'better', 'than'])

在iPython中，你可以很容易地测量时间：

timeit dutchre.findall(zen)
1000 loops, best of 3: 814 us per loop

timeit 'Dutch' in zen
1000000 loops, best of 3: 650 ns per loop

timeit zen.find('Dutch')
1000000 loops, best of 3: 812 ns per loop

timeit splitAndFind(zen, 'Dutch')
10000 loops, best of 3: 18.8 us per loop

回答于 2025-04-16 由 Python大师

分享举报

Python - 从行中给定点获取前后五个词的最佳代码

5 个回答

撰写回答