<p>在上一个正则表达式中<sup>[1]</sup></p>
<pre><code>re.search(r'\bU\W+?S\b\W+?N\b\W+?S\b', text)
</code></pre>
<p>你没有对手,因为你犯了几个错误:</p>
<ul>
<li><code>\w+</code>表示一个或多个单词字符,<code>\W+</code>表示一个或多个非单词字符</李>
<li>有时<code>\b</code>边界锚点位于错误的位置(即,在首字母和单词的其余部分之间)</li>
</ul>
<pre><code>re.search(r'\bU\w+\sS\w+?\sN\w+?\sS\w+', text)
</code></pre>
<p>应该匹配</p>
<p>而且呢,</p>
<pre><code>print(re.search(r'\bu\w+?g\w+\sf\w+', text))
</code></pre>
<p>当然匹配<code>underground facility</code>但是在长文本中,会有更多不相关的匹配</p>
<h3>推广方法</h3>
<p>最后,我构建了一个小“机器”,它可以根据已知的缩写动态创建正则表达式:</p>
<pre class="lang-py prettyprint-override"><code>import re
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
abbrs = ['USNS', 'UGF', 'NFZ', 'AR']
for abbr in abbrs:
pattern = ''.join(map(lambda i: '['+i.upper()+i.lower()+'][a-z]+[ a-z-]', abbr))
print(pattern)
print(re.search(pattern, text, flags=re.IGNORECASE))
</code></pre>
<p>上述脚本的输出为:</p>
<pre class="lang-none prettyprint-override"><code>[Uu][a-z]+[ a-z-][Ss][a-z]+[ a-z-][Nn][a-z]+[ a-z-][Ss][a-z]+[ a-z-]
<re.Match object; span=(20, 45), match='United States Navy Seals '>
[Uu][a-z]+[ a-z-][Gg][a-z]+[ a-z-][Ff][a-z]+[ a-z-]
<re.Match object; span=(89, 110), match='underground facility '>
[Nn][a-z]+[ a-z-][Ff][a-z]+[ a-z-][Zz][a-z]+[ a-z-]
<re.Match object; span=(140, 152), match='no-fly-zone '>
[Aa][a-z]+[ a-z-][Rr][a-z]+[ a-z-]
<re.Match object; span=(170, 184), match='assault-rifle '>
</code></pre>
<h3>进一步推广</h3>
<p>如果我们假设在文本中,每个缩写都是在第一次出现相应的长格式之后引入的,并且我们进一步假设它的书写方式肯定以单词边界开始,肯定以单词边界结束(没有关于大写和连字符使用的假设),我们可以尝试自动提取术语表,如下所示:</p>
<pre class="lang-py prettyprint-override"><code>import re
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
# build a regex for an initial
def init_re(i):
return f'[{i.upper()+i.lower()}][a-z]+[ -]??'
# build a regex for an abbreviation
def abbr_re(abbr):
return r'\b'+''.join([init_re(i) for i in abbr])+r'\b'
# build an inverse glossary from a text
def inverse_glossary(text):
abbreviations = set(re.findall('\([A-Z]+\)', text))
igloss = dict()
for pabbr in abbreviations:
abbr = pabbr[1:-1]
pattern = '('+abbr_re(abbr)+') '+r'\('+abbr+r'\)'
m = re.search(pattern, text)
if m:
longform = m.group(1)
igloss[longform] = abbr
return igloss
igloss = inverse_glossary(text)
for long in igloss:
print('{} -> {}'.format(long, igloss[long]))
</code></pre>
<p>输出是</p>
<pre class="lang-none prettyprint-override"><code>no-fly-zone -> NFZ
United States Navy Seals -> USNS
assault-rifle -> AR
underground facility -> UGF
</code></pre>
<p>通过使用反向词汇表,您可以轻松地将所有长格式替换为相应的缩写。除了第一次发生之外,所有的事情都有点难。有很大的细化空间,例如正确处理长表单中的换行符(也可以使用<a href="https://docs.python.org/3/library/re.html#re.compile" rel="nofollow noreferrer">re.compile</a>)</p>
<p>要用长形式替换缩写,您必须构建一个<strong>标准词汇表,而不是相反的词汇表:</p>
<pre class="lang-py prettyprint-override"><code># build a glossary from a text
def glossary(text):
abbreviations = set(re.findall('\([A-Z]+\)', text))
gloss = dict()
for pabbr in abbreviations:
abbr = pabbr[1:-1]
pattern = '('+abbr_re(abbr)+') '+r'\('+abbr+r'\)'
m = re.search(pattern, text)
if m:
longform = m.group(1)
gloss[abbr] = longform
return gloss
gloss = glossary(text)
for abbr in gloss:
print('{}: {}'.format(abbr, gloss[abbr]))
</code></pre>
<p>这里的输出是</p>
<pre class="lang-none prettyprint-override"><code>AR: assault-rifle
NFZ: no-fly-zone
UGF: underground facility
USNS: United States Navy Seals
</code></pre>
<p>{a2}本身留给读者</p>
<hr/>
<p><sup>[1]</sup>
让我们再仔细看看你的第一个正则表达式:</p>
<pre><code>re.search(r'\bUnited\W+?States\b\W+?Navy\b\W+?Seals\b', text)
</code></pre>
<p>边界锚(<code>\b</code>)是冗余的。可以在不更改结果中任何内容的情况下删除它们,因为<code>\W+?</code>表示在<code>States</code>和<code>Navy</code>的最后一个字符之后至少有一个非单词字符。它们在这里不会引起任何问题,但我想,当您开始修改它以获得更通用的版本时,它们导致了混乱</p>