在python中无法同时使用多个特殊字符或模式提取字符串

'Region' 196 Boston (Boston University, Boston College, Bos... 197 Bridgewater (Bridgewater State College)[2] 198 Cambridge (Harvard University, Massachusetts I... 199 Chestnut Hill (Boston College) 200 The Colleges of Worcester Consortium: 201 Dudley (Nichols College) 240 Faribault, South Central College 241 Mankato (Minnesota State University, Mankato),... 242 Marshall (Southwest Minnesota State University... 243 Moorhead (Minnesota State University, Moorhead... 244 Morris (University of Minnesota Morris)[2] 245 Northfield (Carleton College, St. Olaf College... 246 North Mankato, South Central College 247 St. Cloud (St. Cloud State University, The Col... 248 St. Joseph (College of Saint Benedict)[2] 249 St. Peter (Gustavus Adolphus College)[2]

'RegionName' 196 Boston 197 Bridgewater 198 Cambridge 199 Chestnut Hill 200 The Colleges of Worcester Consortium 201 Dudley 240 Faribault 241 Mankato 242 Marshall 243 Moorhead 244 Morris 245 Northfield 246 North Mankato 247 St. Cloud 248 St. Joseph 249 St. Peter

196 Boston (Boston University, Boston College, Bos... 197 Bridgewater 198 Cambridge (Harvard University, Massachusetts I... 199 Chestnut Hill 200 The Colleges of Worcester Consortium 201 Dudley 240 Faribault 241 Mankato (Minnesota State University, Mankato) 242 Marshall 243 Moorhead (Minnesota State University, Moorhead 244 Morris 245 Northfield (Carleton College 246 North Mankato 247 St. Cloud (St. Cloud State University 248 St. Joseph 249 St. Peter

3条回答

网友

1楼 · 编辑于 2024-06-17 10:50:49

使用此正则表达式：

([\w\s.]+)(?<!\s)

如果您不关心后面的空格，那么可以在结尾处删除(?<!\s)后面的负数。你知道吗

网友

2楼 · 编辑于 2024-06-17 10:50:49

您可以只提取字符串开头的:、,或(以外的任何0个或多个字符

df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)

如果您使用的是python2.x，请在模式的开头使用(?u)，这样单词边界\b也可以匹配Unicode字符串中的正确位置。你知道吗

细节

^-字符串的开头
([^:(,]*)-第1组：零个或更多（*）连续出现的任何字符，而不是（构成否定的字符类）:、(和,。你知道吗
\b-单词边界。你知道吗

请参见下面的regex demo和Python3演示：

>>> from pandas import DataFrame
>>> import pandas as pd
>>> item_list = ['Boston (Boston University, Boston College, Bos...','Bridgewater (Bridgewater State College)[2]','Cambridge (Harvard University, Massachusetts I...','Chestnut Hill (Boston College)','The Colleges of Worcester Consortium:','Dudley (Nichols College)','Faribault, South Central College','Mankato (Minnesota State University, Mankato),...','Marshall (Southwest Minnesota State University...','Moorhead (Minnesota State University, Moorhead...','Morris (University of Minnesota Morris)[2]','Northfield (Carleton College, St. Olaf College...','North Mankato, South Central College','St. Cloud (St. Cloud State University, The Col...','St. Joseph (College of Saint Benedict)[2]','St. Peter (Gustavus Adolphus College)[2]']
>>> df = pd.DataFrame(item_list, columns=['Region'])
>>> df['RegionName'] = df['Region'].str.extract(r'^([^:(,]*)\b', expand=False)
>>> df['RegionName']

                              RegionName  
0                                 Boston  
1                            Bridgewater  
2                              Cambridge  
3                          Chestnut Hill  
4   The Colleges of Worcester Consortium  
5                                 Dudley  
6                              Faribault  
7                                Mankato  
8                               Marshall  
9                               Moorhead  
10                                Morris  
11                            Northfield  
12                         North Mankato  
13                             St. Cloud  
14                            St. Joseph  
15                             St. Peter  
>>>

网友

3楼 · 编辑于 2024-06-17 10:50:49

由于只有三个可能的分隔符，因此可以利用chained split（），因为如果找不到分隔符，split将返回未修改的字符串。你知道吗

>>> s = """196    Boston (Boston University, Boston College, Bos...
... 197           Bridgewater (Bridgewater State College)[2]
... 198    Cambridge (Harvard University, Massachusetts I...
... 199                       Chestnut Hill (Boston College)
... 200                The Colleges of Worcester Consortium:
... 201                             Dudley (Nichols College)
... 240                     Faribault, South Central College
... 241    Mankato (Minnesota State University, Mankato),...
... 242    Marshall (Southwest Minnesota State University...
... 243    Moorhead (Minnesota State University, Moorhead...
... 244           Morris (University of Minnesota Morris)[2]
... 245    Northfield (Carleton College, St. Olaf College...
... 246                 North Mankato, South Central College
... 247    St. Cloud (St. Cloud State University, The Col...
... 248            St. Joseph (College of Saint Benedict)[2]
... 249             St. Peter (Gustavus Adolphus College)[2]"""
>>> for i in s.split('\n'):
...    number, text = i.split('(')[0].split(',')[0].split(':')[0].split(' ',1)
...    print('{} {}'.format(number, text.strip()))
...
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter

可以使用^{}对字符串进行相同的转换。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章