使用正则表达式将多个单词与文本匹配

import re # Uncomment when Matching 4-gram words #findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*(?:\s[A-Z]\w*(?:\s[A-Z]\w*)?)?)?)') # Uncomment when Matching tri-gram words #findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*(?:\s[A-Z]\w*)?)?)') # Uncomment when Matching bi-gram words findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')

def is_name_in_text(text, matching_list): for possible_name in set(findnames.findall(text)): if possible_name in matching_list: print(possible_name) return possible_name

1条回答

网友

1楼 · 发布于 2024-05-26 21:53:44

你想要匹配单词级的n-grams，特别是单词级的bigrams。你知道吗

但是，您提供的正则表达式：([A-Z]\w*(?:\s[A-Z]\w*)?)匹配任何以A到Z范围内的字符开头的单词字符字符串，可以选择后跟一个空格和另一个这样的字符串。你知道吗

使用这个正则表达式永远不会匹配c# developer，因为它不是以A到Z开头，而是包含#。它也不会匹配.net engineer，因为它以.开头。另外，您正在匹配.net engineer，但它在文本中是.Net engineer。你知道吗

另外，通过使用regex和findall，regex将以大写单词对的形式使用字符串，从而防止重用。因此，在匹配Corporate Account之后，它永远不会匹配Account Manager，因为Account部分已经被消耗了。您使用的是非捕获组，但这仍然会导致正则表达式使用字符串的该部分。你知道吗

假设您确实希望匹配不区分大小写的单词级别的n-gram，并且需要匹配像#这样的特殊字符，我认为您不能用一个regex实现所需的功能，但是一些相当基本的Python代码可以让您达到这一目的。你知道吗

考虑一下，过滤掉任何不完全由单词字符或您喜欢的特殊字符组成的n-gram可能是没有效率的。为什么不简单地把字符串按间隔分开，然后找到你要找的n-gram呢？你知道吗

import re

text = 'I am a Corporate Account Manager with experience as Data Scientist' \
       ' Associate Research Scientist Post Doctoral Research Fellow Research' \
       ' Scientist Assistant Professor .Net engineer c# developer'

matching_list = [
    'Data Scientist',
    'Associate Research Scientist',
    'Post Doctoral Research Fellow',
    'Research Scientist',
    'Assistant Professor',
    'c# developer',
    '.net engineer'
]


def get_ngrams(words, n):
    return zip(*[words[m:len(words)-(1-m)] for m in range(n)])


def main():
    # simply split up the text, you could also just go words = text.split()
    regex = re.compile(r'[^\s]+')
    words = regex.findall(text.lower())
    # turn the list of words into ngrams of the needed length
    ngrams = list(get_ngrams(words, 2))
    # also create ngrams for the phrases in matching_list 
    # then link them to the phrases in a dict for easy reference
    matching_ngrams = {
        k: v for k, v in zip(
            [tuple(x.lower().split()) for x in matching_list], matching_list 
        )
    }

    # find all the matching ones and print the matching phrase when found
    for find_this in ngrams:
        if find_this in matching_ngrams:
            print(matching_ngrams[find_this])


main()

请注意，这仍然会生成重复的结果，您指出您只期望每个结果出现一次。您可以通过翻转循环和比较来实现这一点：

    for find_this in matching_ngrams:
        if find_this in ngrams:
            print(matching_ngrams[find_this])

这将更频繁地浏览较长的列表，花费更多的时间，但如果每个短语出现在文本中，它将只打印一次。或者，可以创建一个函数，返回所有匹配项并将它们放入set。你知道吗

为了避免列表、查找效率低下和不必要的re，我更喜欢这样：

def get_ngrams(words, n):
    return zip(*[words[m:len(words) - (1 - m)] for m in range(n)])


def find_matching_ngrams(text, phrases, n):
    ngrams_phrases = {
        k: v for k, v in zip(
            [tuple(x.lower().split()) for x in phrases], phrases
        )
    }

    for ngram in get_ngrams(text.lower().split(), n):
        if ngram in ngrams_phrases :
            yield ngrams_phrases[ngram]


def main():
    text = 'I am a Corporate Account Manager with experience as Data Scientist' \
           ' Associate Research Scientist Post Doctoral Research Fellow Research' \
           ' Scientist Assistant Professor .Net engineer c# developer'

    matching_list = [
        'Data Scientist',
        'Associate Research Scientist',
        'Post Doctoral Research Fellow',
        'Research Scientist',
        'Assistant Professor',
        'c# developer',
        '.net engineer'
    ]

    print(set(find_matching_ngrams(text, matching_list, 2)))


main()

可能效率更高一些：

def get_ngrams(words, n):
    for m in range(len(words)-(n-1)):
        yield tuple(words[m:m+n])

相关问题更多 >

编程相关推荐

热门问题

热门文章