使用regex从html文本中获取两个或更多连续的大写单词

import re import unittest from bs4 import BeautifulSoup html_page = """ <html> <body> <table> <tr class=tb1><td>Lorem Ipsum dolor Sit amet</td></tr> <tr class=tb1><td>Consectetuer adipiscing elit</td></tr> <tr><td>Aliquam Tincidunt mauris eu Risus</td></tr> <tr><td>Vestibulum Auctor Dapibus neque</td></tr> </table> </body> </html> """ soup = BeautifulSoup(html_page) text = soup.get_text() def get_sequences(page): ex = re.compile('[A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+') sequences = re.findall(ex, page) return sequences print get_sequences(text)

3条回答

网友

1楼 · 编辑于 2024-04-26 11:35:59

这种做法是正确的，但没有针对性。您要查找的是一行中两个或多个连续的大写单词。所以，应该在文本中的行上运行regex。这样做的诀窍：

def get_sequences(page):
    ex = re.compile('[A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+')
    sequences = []
    for x in page.splitlines():
        sequences.append(re.findall(ex, x))
    sequences = sum(sequences,[])
    return sequences

网友
2楼 · 编辑于 2024-04-26 11:35:59

Python代码：
# coding=utf8 # the above tag defines encoding for this document and is for Python 2.x compatibility import re regex = r"[A-Z][a-z]+\s+[A-Z][a-z]+" test_str = ("<html>\n" "<body>\n" "<table>\n" "<tr class=tb1><td>Lorem Ipsum dolor Sit amet</td></tr>\n" "<tr class=tb1><td>Consectetuer adipiscing elit</td></tr>\n" "<tr><td>Aliquam Tincidunt mauris eu Risus</td></tr>\n" "<tr><td>Vestibulum Auctor Dapibus neque</td></tr>\n" "</table>\n" "</body>\n" "</html>\n" "\"\"\"") matches = re.finditer(regex, test_str, re.MULTILINE) for matchNum, match in enumerate(matches): matchNum = matchNum + 1 print (match.group()) # Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
结果：
Lorem Ipsum Aliquam Tincidunt Vestibulum Auctor
见：http://ideone.com/iQev8D

网友
3楼 · 编辑于 2024-04-26 11:35:59

您可以使用以下选项：

((?:[A-Z][a-z]+\s*){2,})

https://regex101.com/r/EeS7F5/1示例

您还可以修改当前的regex并去掉lookahead

视图https://regex101.com/r/vViHXm/1

相关问题更多 >

编程相关推荐

热门问题

热门文章