如何在文本文件中使用Python进行二分查找关键词？

11 投票

6 回答

9386 浏览

数据工程师

提问于 2025-04-16 13:09

这个文本文件里有两列数据——第一列是索引号（占5个空格），第二列是字符（占30个空格）。这些数据是按照字母顺序排列的。我想用二分查找的方法来搜索一个关键词。

数据结构文本处理搜索算法二分查找

6 个回答

如果你需要在一个文件中找到一个特定的关键词：

line_with_keyword = next((line for line in open('file') if keyword in line),None)
if line_with_keyword is not None: 
   print line_with_keyword # found

如果想要找到多个关键词，你可以使用 set()，就像@kriegar 提到的那样：

def extract_keyword(line):
    return line[5:35] # assuming keyword starts on 6 position and has length 30

with open('file') as f:
    keywords = set(extract_keyword(line) for line in f) # O(n) creation
    if keyword in keywords: # O(1) search
       print(keyword)

你也可以用 dict() 来代替 set()，这样可以保留 index 信息。

下面是如何在文本文件中进行二分查找的方法：

import bisect

lines = open('file').readlines() # O(n) list creation
keywords = map(extract_keyword, lines) 
i = bisect.bisect_left(keywords, keyword) # O(log(n)) search
if keyword == keywords[i]:
   print(lines[i]) # found

与使用 set() 的方法相比，这种方法没有什么优势。

注意：除了第一种方法，其他所有方法都会把整个文件加载到内存中。FileSearcher() 是 @Mahmoud Abdelkader 提出的，使用这个方法不需要把整个文件加载到内存中。

回答于 2025-04-16 由 Python大师

分享举报

你需要进行二分查找吗？如果不需要的话，可以试着把你的平面文件转换成一个cdb（常量数据库）。这样可以让你快速查找某个词的索引，速度非常快：

import cdb

# convert the corpus file to a constant database one time
db = cdb.cdbmake('corpus.db', 'corpus.db_temp')
for line in open('largecorpus.txt', 'r'):
    index, word = line.split()
    db.add(word, index)
db.finish()

在另一个脚本中，可以对它进行查询：

import cdb
db = cdb.init('corpus.db')
db.get('chaos')
12345

回答于 2025-04-16 由 Python大师

分享举报

这里有一个有趣的方法，可以使用Python自带的bisect模块来实现。

import bisect
import os


class Query(object):

    def __init__(self, query, index=5):
        self.query = query
        self.index = index

    def __lt__(self, comparable):
        return self.query < comparable[self.index:]


class FileSearcher(object):

    def __init__(self, file_pointer, record_size=35):
        self.file_pointer = file_pointer
        self.file_pointer.seek(0, os.SEEK_END)
        self.record_size = record_size + len(os.linesep)
        self.num_bytes = self.file_pointer.tell()
        self.file_size = (self.num_bytes // self.record_size)

    def __len__(self):
        return self.file_size

    def __getitem__(self, item):
        self.file_pointer.seek(item * self.record_size)
        return self.file_pointer.read(self.record_size)


if __name__ == '__main__':
    with open('data.dat') as file_to_search:
        query = raw_input('Query: ')
        wrapped_query = Query(query)

        searchable_file = FileSearcher(file_to_search)
        print "Located @ line: ", bisect.bisect(searchable_file, wrapped_query)

回答于 2025-04-16 由 Python大师

分享举报

如何在文本文件中使用Python进行二分查找关键词？

6 个回答

撰写回答