用于查找有效 Sphinx 字段的正则表达式

2 投票

6 回答

763 浏览

数据工程师

提问于 2025-04-15 21:50

我正在尝试验证给sphinx的字段是否有效，但遇到了一些困难。

想象一下，有效的字段是猫、老鼠、狗、小狗。

有效的搜索示例包括：

@猫搜索词
@(猫) 搜索词
@(猫, 狗) 搜索词
@猫搜索词1 @狗搜索词2
@(猫, 狗) 搜索词1 @老鼠搜索词2

所以，我想用正则表达式来找到上面例子中的猫、狗、老鼠这些词，并将它们与有效词汇表进行对比。

因此，像这样的查询：

@(山羊)

会产生一个错误，因为山羊不是一个有效的词。

我已经能够用这个正则表达式找到简单的查询，比如@猫： (?:@)([^( ]*)

但我还不知道怎么找到其他的。

我在使用python和django，不知道这是否有帮助。

正则表达式文本处理数据验证搜索引擎 sphinx 字段匹配查询解析有效词汇表

6 个回答

这个pyparsing的解决方案和你发的答案逻辑差不多。它会先匹配所有的标签，然后再和已知的有效标签列表进行对比，把有效的标签从结果中去掉。最后，只有那些在去掉有效标签后还剩下的匹配项才会被报告为匹配。

from pyparsing import *

# define the pattern of a tag, setting internal results names for easy validation
AT,LPAR,RPAR = map(Suppress,"@()")
term = Word(alphas,alphanums).setResultsName("terms",listAllMatches=True)
sphxTerm = AT + ~White() + ( term | LPAR + delimitedList(term) + RPAR )

# define tags we consider to be valid
valid = set("cat mouse dog".split())

# define a parse action to filter out valid terms, and attach to the sphxTerm
def filterValid(tokens):
    tokens = [t for t in tokens.terms if t not in valid]
    if not(tokens):
        raise ParseException("",0,"")
    return tokens
sphxTerm.setParseAction(filterValid)


##### Test out the parser #####

test = """@cat search terms @ house
    @(cat) search terms 
    @(cat, dog) search term @(goat)
    @cat searchterm1 @dog searchterm2 @(cat, doggerel)
    @(cat, dog) searchterm1 @mouse searchterm2 
    @caterpillar"""

# scan for invalid terms, and print out the terms and their locations
for t,s,e in sphxTerm.scanString(test):
    print "Terms:%s Line: %d Col: %d" % (t, lineno(s, test), col(s, test))
    print line(s, test)
    print " "*(col(s,test)-1)+"^"
    print

这样就得到了很不错的结果：

Terms:['goat'] Line: 3 Col: 29
    @(cat, dog) search term @(goat)
                            ^

Terms:['doggerel'] Line: 4 Col: 39
    @cat searchterm1 @dog searchterm2 @(cat, doggerel)
                                      ^

Terms:['caterpillar'] Line: 6 Col: 5
    @caterpillar
    ^

最后这段代码会帮你完成所有的扫描工作，只会给你找到的无效标签的列表：

# print out all of the found invalid terms
print list(set(sum(sphxTerm.searchString(test), ParseResults([]))))

输出：

['caterpillar', 'goat', 'doggerel']

回答于 2025-04-15 由 Python大师

分享举报

为了匹配所有允许的字段，下面这个看起来有点复杂的正则表达式可以用：

@((?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))

它会按顺序返回这些匹配项：@cat、@(cat)、@(cat, dog)、@cat、@dog、@(cat, dog)、@mouse。

这个正则表达式可以分解成以下几个部分：

@                               # the literal character "@"
(                               # match group 1
  (?:cat|mouse|dog|puppy)       #  one of your valid search terms (not captured)
  \b                            #  a word boundary
  |                             #  or...
  \(                            #  a literal opening paren
  (?:                           #  non-capturing group
    (?:cat|mouse|dog|puppy)     #   one of your valid search terms (not captured)
    (?:                         #   non-capturing group
      , *                       #    a comma "," plus any number of spaces
      |                         #    or...
      (?=\))                    #    a position followed by a closing paren
    )                           #   end non-capture group
  )+                            #  end non-capture group, repeat
  \)                            #  a literal closing paren
)                               # end match group one.

现在，如果你想找出任何无效的搜索，可以把上面的内容放在一个负向前瞻中：

@(?!(?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))
--^^

这样就能找到任何@后面跟着无效搜索词（或者词组合）的情况。把它修改成也能匹配这些无效尝试，而不仅仅是指出它们，这样做也不算太难。

你需要动态生成(?:cat|mouse|dog|puppy)这个部分，并把它放进正则表达式的其他静态部分里。这样做应该也不会太复杂。

回答于 2025-04-15 由 Python大师

分享举报

我最后用了一种不同的方法，因为上面提到的都没用。首先，我用这个找到了像@cat这样的字段：

attributes = re.findall('(?:@)([^\( ]*)', query)

接着，我用这个找到了更复杂的字段：

regex0 = re.compile('''
    @               # at sign
    (?:             # start non-capturing group
        \w+             # non-whitespace, one or more
        \b              # a boundary character (i.e. no more \w)
        |               # OR
        (               # capturing group
            \(              # left paren
            [^@(),]+        # not an @(),
            (?:                 # another non-caputing group
                , *             # a comma, then some spaces
                [^@(),]+        # not @(),
            )*              # some quantity of this non-capturing group
            \)              # a right paren
        )               # end of non-capuring group
    )           # end of non-capturing group
    ''', re.VERBOSE)

# and this puts them into the attributes list.
groupedAttributes = re.findall(regex0, query)
for item in groupedAttributes:
    attributes.extend(item.strip("(").strip(")").split(", "))

然后，我检查了一下我找到的属性是否有效，并把它们（唯一地）添加到一个数组里：

# check if the values are valid.
validRegex = re.compile(r'^mice$|^mouse$|^cat$|^dog$')

# if they aren't add them to a new list.
badAttrs = []
for attribute in attributes:
    if len(attribute) == 0:
        # if it's a zero length attribute, we punt
        continue
    if validRegex.search(attribute.lower()) == None:
        # if the attribute from the search isn't in the valid list
        if attribute not in badAttrs:
            # and the attribute isn't already in the list
            badAttrs.append(attribute)

不过还是要感谢大家的帮助。我很高兴能得到这些支持！

回答于 2025-04-15 由 Python大师

分享举报

用于查找有效 Sphinx 字段的正则表达式

6 个回答

撰写回答