回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我试图从文本中提取具有长度大于8的数字/字母数字字符的标记/部分标记</p>
<p>例如:</p>
<pre><code>text = 'https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8'
</code></pre>
<p>预期产出将是:</p>
<pre><code>59800512 510557XXXXXX2302 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg 69i57j0i22i30l8j0i390 4672j0j7
</code></pre>
<p>我尝试使用正则表达式:<code>((\d+)|([A-Za-z]+\d)[\dA-Za-z]*)</code>基于答案<a href="https://stackoverflow.com/questions/22997088/python-alphanumeric-regex">Python Alphanumeric Regex</a>。我得到了以下结果:</p>
<pre><code>[match for match in re.findall(r"((\d+)|([A-Za-z]+\d)[\dA-Za-z]*)",text)]
Output :
[('59800512', '59800512', ''),
('510557', '510557', ''),
('XXXXXX2302', '', 'XXXXXX2'),
('1601371803', '1601371803', ''),
('NhLw6NlR0EksRWkLddEo7NiEvrg', '', 'NhLw6'),
('69', '69', ''),
('i57j0i22i30l8j0i390', '', 'i5'),
('4672', '4672', ''),
('j0j7', '', 'j0'),
('8', '8', '')]
</code></pre>
<p>我将为每个匹配令牌获取匹配组的元组</p>
<p>可以再次过滤这些元组。但我正在努力使代码尽可能高效和通俗</p>
<p>有人能提出解决办法吗?它不需要基于正则表达式</p>
<p>提前谢谢</p>
<p><strong>编辑</strong>:
我希望字母数字值的长度等于或大于8</p>