如何从数据框列中提取特定部分并填充到另一列中?
我有一个数据框,看起来是这样的:
section_id section_name
1.Test Summary9
1.1.Synopsis9
1.2.Schema12
1.3.1.Test Period I - Screening13
1.3.2.Period II - obes-Treatment 15
Synopsis
Test Period I - Screening
你可以看到里面还有空白的行。现在我想把“section id”这一列填充成这样:
section_id section_name
1 1.Test Summary9
1.1 1.1.Synopsis9
1.2 1.2.Schema12
1.3.1 1.3.1.Test Period I - Screening13
1.3.2 1.3.2.Period II - obes-Treatment 15
1.1 Synopsis
1.3.1 Test Period I - Screening
基本上,如果“section name”以“section id”开头,就直接提取并填充。如果有类似的“section name”(比如:ex:1.1.Synopsis9 和 Synopsis
),就填充相同的“section id”。对于空白的值,就不做任何处理。
我试过这样做,但有些情况下不太管用。请给我一些更好的建议:
import pandas as pd
data = {
'section_name': [
'1.Test Summary9',
'1.1.Synopsis9',
'1.2.Schema12',
'1.3.1.Test Period I - Screening13',
'1.3.2.Period II - obes-Treatment 15',
'Synopsis',
'Test Period I - Screening'
]
}
df = pd.DataFrame(data)
def extract_section_id(section_name, current_section_id):
if section_name.startswith(current_section_id):
return current_section_id
else:
return section_name.split('.')[0]
current_section_id = ''
section_ids = []
for index, row in df.iterrows():
section_name = row['section_name'].strip()
if section_name != '':
section_id = extract_section_id(section_name, current_section_id)
current_section_id = section_id
else:
section_id = ''
section_ids.append(section_id)
df['section_id'] = section_ids
print(df)
1 个回答
1
你可以这样做:
import pandas as pd
import numpy as np
import re
data = {
'section_id': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'section_name': [
'1.Test Summary9',
'1.1.Synopsis9',
'1.2.Schema12',
'1.3.1.Test Period I - Screening13',
'1.3.2.Period II - obes-Treatment 15',
'Synopsis',
'Test Period I - Screening',
None,
''
]
}
df1 = pd.DataFrame(data)
def extract_section_id(row, prev_ids):
if row['section_name'] is None or row['section_name'].strip() == '':
return np.nan
match = re.match(r'(\d+(?:\.\d+)*)(.*)', str(row['section_name']))
if match:
sec_id, sec_name = match.groups()
prev_ids[sec_name.strip()] = sec_id
return sec_id
else:
for name, id_ in prev_ids.items():
if row['section_name'].strip() in name:
return id_
return np.nan
prev_ids = {}
df1['section_id'] = df1.apply(lambda row: extract_section_id(row, prev_ids), axis=1)
print(df1)
这样做会得到:
section_id section_name
0 1 1.Test Summary9
1 1.1 1.1.Synopsis9
2 1.2 1.2.Schema12
3 1.3.1 1.3.1.Test Period I - Screening13
4 1.3.2 1.3.2.Period II - obes-Treatment 15
5 1.1 Synopsis
6 1.3.1 Test Period I - Screening
7 NaN None
8 NaN
如果你真的想要一个空白的结果
import pandas as pd
import numpy as np
import re
data = {
'section_id': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'section_name': [
'1.Test Summary9',
'1.1.Synopsis9',
'1.2.Schema12',
'1.3.1.Test Period I - Screening13',
'1.3.2.Period II - obes-Treatment 15',
'Synopsis',
'Test Period I - Screening',
None,
''
]
}
df1 = pd.DataFrame(data)
def extract_section_id(row, prev_ids):
if row['section_name'] is None or row['section_name'].strip() == '':
return ''
match = re.match(r'(\d+(?:\.\d+)*)(.*)', str(row['section_name']))
if match:
sec_id, sec_name = match.groups()
prev_ids[sec_name.strip()] = sec_id
return sec_id
else:
for name, id_ in prev_ids.items():
if row['section_name'].strip() in name:
return id_
return np.nan
prev_ids = {}
df1['section_id'] = df1.apply(lambda row: extract_section_id(row, prev_ids), axis=1)
print(df1)
将会得到:
section_id section_name
0 1 1.Test Summary9
1 1.1 1.1.Synopsis9
2 1.2 1.2.Schema12
3 1.3.1 1.3.1.Test Period I - Screening13
4 1.3.2 1.3.2.Period II - obes-Treatment 15
5 1.1 Synopsis
6 1.3.1 Test Period I - Screening
7 None
8