python运算符in与正则表达式

foo = "some long and boring medical paper [...] that I'm searching" bar = [["array of medical terms matched with an unique code",1], ["also they are sorted by length",2]] for term in bar: if term[0] in foo: repetitions = foo.count(term[0]) array_to_be_inserted_in_database.append(term[1],repetitions)

t= timeit.Timer( 're.subn(regex,"",frase)', setup = 'import re; frase = "el gato gordo de la abuela"; palabra = "gordo"; regex = re.compile(palabra)' ) ordenes = """\ if palabra in frase: numero = frase.count(palabra) frase.replace(palabra,"") """ y= timeit.Timer(stmt= ordenes,setup = 'frase = "el gato gordo de la abuela"; palabra = "gordo"' ) print t.timeit(number = 1000) print y.timeit(number = 1000)

2条回答

网友

1楼 · 编辑于 2024-06-17 11:56:07

您可以在所有论文上创建一个索引，列出哪些词出现在哪些论文中，然后只搜索包含所有相关词的论文，而不是单独搜索所有词。这样，你将不得不扫描每一篇论文一次来建立索引，然后你只需要对那些你知道它们包含所有相关术语的论文进行全文搜索。你知道吗

非常简单的伪代码：

# get interesting words
interesting_words = set(word for term in terms for word in term.split())

# build index, mapping interesting words to papers they appear in
index = defaultdict(set)
for paper in papers:
    for word in paper.text:
        if word in interesting_words:
            index[word].add(paper)

# find full terms in papers that have all the words, according to the index
for term in terms:
    interesting = reduce(set.intersection, (index[word] for word in term.split()))
    for paper in interesting:
        if term in interesting.text:
            full term found

（注意：我不是索引/数据检索方面的专家，可能有更好的方法来创建这样的索引，可能还有一些库已经这样做了。）

网友

2楼 · 编辑于 2024-06-17 11:56:07

如果您只处理逐字字符串（而不是模式），并且如果您不介意像gut这样的术语是否也匹配像gutter这样的较长单词，那么in可能会更快。你知道吗

另一方面，您可以使用re.findall()一次进行所有匹配，并获得结果列表的长度，这样就不必对字符串进行两次检查（一次用于查找，一次用于计数）。中和特殊字符很容易-只需调用字符串上的re.escape()，它将确保文本按原样匹配。你知道吗

最后，唯一可以确定的方法是用真实世界的数据测试这两种解决方案。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章