如何从数据框列中提取特定部分并填充到另一列中?

1 投票
1 回答
45 浏览
提问于 2025-04-14 16:13

我有一个数据框,看起来是这样的:

section_id  section_name
            1.Test Summary9
            1.1.Synopsis9
            1.2.Schema12
            1.3.1.Test Period  I - Screening13
            1.3.2.Period II - obes-Treatment 15
            Synopsis

            Test Period  I - Screening

你可以看到里面还有空白的行。现在我想把“section id”这一列填充成这样:

section_id  section_name
1           1.Test Summary9
1.1         1.1.Synopsis9
1.2         1.2.Schema12
1.3.1       1.3.1.Test Period  I - Screening13
1.3.2       1.3.2.Period II - obes-Treatment 15
1.1         Synopsis
1.3.1       Test Period  I - Screening

基本上,如果“section name”以“section id”开头,就直接提取并填充。如果有类似的“section name”(比如:ex:1.1.Synopsis9 和 Synopsis),就填充相同的“section id”。对于空白的值,就不做任何处理。

我试过这样做,但有些情况下不太管用。请给我一些更好的建议:

import pandas as pd

data = {
    'section_name': [
        '1.Test Summary9',
        '1.1.Synopsis9',
        '1.2.Schema12',
        '1.3.1.Test Period  I - Screening13',
        '1.3.2.Period II - obes-Treatment 15',
        'Synopsis',
        'Test Period  I - Screening'
    ]
}

df = pd.DataFrame(data)

def extract_section_id(section_name, current_section_id):
    if section_name.startswith(current_section_id):
        return current_section_id
    else:
        return section_name.split('.')[0]

current_section_id = ''
section_ids = []

for index, row in df.iterrows():
    section_name = row['section_name'].strip()
    if section_name != '':
        section_id = extract_section_id(section_name, current_section_id)
        current_section_id = section_id
    else:
        section_id = ''
    section_ids.append(section_id)

df['section_id'] = section_ids

print(df)

1 个回答

1

你可以这样做:

import pandas as pd
import numpy as np
import re

data = {
    'section_id': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
    'section_name': [
        '1.Test Summary9',
        '1.1.Synopsis9',
        '1.2.Schema12',
        '1.3.1.Test Period  I - Screening13',
        '1.3.2.Period II - obes-Treatment 15',
        'Synopsis',
        'Test Period  I - Screening',
        None,
        ''
    ]
}
df1 = pd.DataFrame(data)

def extract_section_id(row, prev_ids):
    if row['section_name'] is None or row['section_name'].strip() == '':
        return np.nan
    
    match = re.match(r'(\d+(?:\.\d+)*)(.*)', str(row['section_name']))
    if match:
        sec_id, sec_name = match.groups()
        prev_ids[sec_name.strip()] = sec_id
        return sec_id
    else:
        for name, id_ in prev_ids.items():
            if row['section_name'].strip() in name:
                return id_
    return np.nan

prev_ids = {}

df1['section_id'] = df1.apply(lambda row: extract_section_id(row, prev_ids), axis=1)

print(df1)

这样做会得到:

  section_id                         section_name
0          1                      1.Test Summary9
1        1.1                        1.1.Synopsis9
2        1.2                         1.2.Schema12
3      1.3.1   1.3.1.Test Period  I - Screening13
4      1.3.2  1.3.2.Period II - obes-Treatment 15
5        1.1                             Synopsis
6      1.3.1           Test Period  I - Screening
7        NaN                                 None
8        NaN                                     

如果你真的想要一个空白的结果

import pandas as pd
import numpy as np
import re

data = {
    'section_id': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
    'section_name': [
        '1.Test Summary9',
        '1.1.Synopsis9',
        '1.2.Schema12',
        '1.3.1.Test Period  I - Screening13',
        '1.3.2.Period II - obes-Treatment 15',
        'Synopsis',
        'Test Period  I - Screening',
        None,
        ''
    ]
}
df1 = pd.DataFrame(data)

def extract_section_id(row, prev_ids):
    if row['section_name'] is None or row['section_name'].strip() == '':
        return ''
    
    match = re.match(r'(\d+(?:\.\d+)*)(.*)', str(row['section_name']))
    if match:
        sec_id, sec_name = match.groups()
        prev_ids[sec_name.strip()] = sec_id
        return sec_id
    else:
        for name, id_ in prev_ids.items():
            if row['section_name'].strip() in name:
                return id_
    return np.nan

prev_ids = {}

df1['section_id'] = df1.apply(lambda row: extract_section_id(row, prev_ids), axis=1)

print(df1)

将会得到:

  section_id                         section_name
0          1                      1.Test Summary9
1        1.1                        1.1.Synopsis9
2        1.2                         1.2.Schema12
3      1.3.1   1.3.1.Test Period  I - Screening13
4      1.3.2  1.3.2.Period II - obes-Treatment 15
5        1.1                             Synopsis
6      1.3.1           Test Period  I - Screening
7                                            None
8                                                

撰写回答