连字符“”在对BeautifulSoup使用正则表达式时产生问题

# get the first columns of row 19 from the table and get its text test = data_collector[19].find_all('td')[0] text = test.get_text() #create and test the pattern pattern = re.compile('\(\d\d\d\d\)|\(\d\d\d\d-\d\d\d\d\)|\(\d\d\d\d-[ Ppresent]*\)') re.findall(pattern, 'This is Agent (1857), the years were (1987-1868), which lasted from (1678- Present)')

1条回答

网友

1楼 · 发布于 2024-05-20 00:05:06

我建议增强模式以搜索最常见的连字符-、–和—，并将present模式从字符类固定到字符序列（以便不将sent与[ Ppresent]*匹配）：

re.compile(r'\(\d{4}(?:[\s–—-]+(?:\d{4}|present))?\)', re.I)

参见regex demo。注意，re.I标志将使regex以不区分大小写的方式匹配

细节

\(-a(
\d{4}-四个数字（{4}是一个限制量词，重复它修改的模式四次）
(?:[\s–—-]+(?:\d{4}|present))?-可选的（因为结尾有一个?）非捕获（由于?:）组匹配
- [\s–—-]+-1个或多个空格、-、—或–
- (?:\d{4}|present)-4位数或present
\)-a)字符

如果要匹配任何连字符，请使用[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\s]+而不是[\s–—-]+

或者，要匹配该位置的任何1+非单词字符，可能除了(和)，请使用[^\w()]+：re.compile(r'\(\d{4}(?:[^\w()]+(?:\d{4}|present))?\)', re.I)

相关问题更多 >

编程相关推荐

热门问题

热门文章