如何从数据框列中提取特定部分并填充到另一列中？

1 投票

1 回答

45 浏览

提问于 2025-04-14 16:13

我有一个数据框，看起来是这样的：

section_id  section_name
            1.Test Summary9
            1.1.Synopsis9
            1.2.Schema12
            1.3.1.Test Period  I - Screening13
            1.3.2.Period II - obes-Treatment 15
            Synopsis

            Test Period  I - Screening

你可以看到里面还有空白的行。现在我想把“section id”这一列填充成这样：

section_id  section_name
1           1.Test Summary9
1.1         1.1.Synopsis9
1.2         1.2.Schema12
1.3.1       1.3.1.Test Period  I - Screening13
1.3.2       1.3.2.Period II - obes-Treatment 15
1.1         Synopsis
1.3.1       Test Period  I - Screening

基本上，如果“section name”以“section id”开头，就直接提取并填充。如果有类似的“section name”（比如：ex:1.1.Synopsis9 和 Synopsis），就填充相同的“section id”。对于空白的值，就不做任何处理。

我试过这样做，但有些情况下不太管用。请给我一些更好的建议：

import pandas as pd

data = {
    'section_name': [
        '1.Test Summary9',
        '1.1.Synopsis9',
        '1.2.Schema12',
        '1.3.1.Test Period  I - Screening13',
        '1.3.2.Period II - obes-Treatment 15',
        'Synopsis',
        'Test Period  I - Screening'
    ]
}

df = pd.DataFrame(data)

def extract_section_id(section_name, current_section_id):
    if section_name.startswith(current_section_id):
        return current_section_id
    else:
        return section_name.split('.')[0]

current_section_id = ''
section_ids = []

for index, row in df.iterrows():
    section_name = row['section_name'].strip()
    if section_name != '':
        section_id = extract_section_id(section_name, current_section_id)
        current_section_id = section_id
    else:
        section_id = ''
    section_ids.append(section_id)

df['section_id'] = section_ids

print(df)

数据处理 pandas 数据框列提取空白值处理 section id section name

1 个回答

你可以这样做：

import pandas as pd
import numpy as np
import re

data = {
    'section_id': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
    'section_name': [
        '1.Test Summary9',
        '1.1.Synopsis9',
        '1.2.Schema12',
        '1.3.1.Test Period  I - Screening13',
        '1.3.2.Period II - obes-Treatment 15',
        'Synopsis',
        'Test Period  I - Screening',
        None,
        ''
    ]
}
df1 = pd.DataFrame(data)

def extract_section_id(row, prev_ids):
    if row['section_name'] is None or row['section_name'].strip() == '':
        return np.nan
    
    match = re.match(r'(\d+(?:\.\d+)*)(.*)', str(row['section_name']))
    if match:
        sec_id, sec_name = match.groups()
        prev_ids[sec_name.strip()] = sec_id
        return sec_id
    else:
        for name, id_ in prev_ids.items():
            if row['section_name'].strip() in name:
                return id_
    return np.nan

prev_ids = {}

df1['section_id'] = df1.apply(lambda row: extract_section_id(row, prev_ids), axis=1)

print(df1)

这样做会得到：

  section_id                         section_name
0          1                      1.Test Summary9
1        1.1                        1.1.Synopsis9
2        1.2                         1.2.Schema12
3      1.3.1   1.3.1.Test Period  I - Screening13
4      1.3.2  1.3.2.Period II - obes-Treatment 15
5        1.1                             Synopsis
6      1.3.1           Test Period  I - Screening
7        NaN                                 None
8        NaN

如果你真的想要一个空白的结果

import pandas as pd
import numpy as np
import re

data = {
    'section_id': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
    'section_name': [
        '1.Test Summary9',
        '1.1.Synopsis9',
        '1.2.Schema12',
        '1.3.1.Test Period  I - Screening13',
        '1.3.2.Period II - obes-Treatment 15',
        'Synopsis',
        'Test Period  I - Screening',
        None,
        ''
    ]
}
df1 = pd.DataFrame(data)

def extract_section_id(row, prev_ids):
    if row['section_name'] is None or row['section_name'].strip() == '':
        return ''
    
    match = re.match(r'(\d+(?:\.\d+)*)(.*)', str(row['section_name']))
    if match:
        sec_id, sec_name = match.groups()
        prev_ids[sec_name.strip()] = sec_id
        return sec_id
    else:
        for name, id_ in prev_ids.items():
            if row['section_name'].strip() in name:
                return id_
    return np.nan

prev_ids = {}

df1['section_id'] = df1.apply(lambda row: extract_section_id(row, prev_ids), axis=1)

print(df1)

将会得到：

  section_id                         section_name
0          1                      1.Test Summary9
1        1.1                        1.1.Synopsis9
2        1.2                         1.2.Schema12
3      1.3.1   1.3.1.Test Period  I - Screening13
4      1.3.2  1.3.2.Period II - obes-Treatment 15
5        1.1                             Synopsis
6      1.3.1           Test Period  I - Screening
7                                            None
8

回答于 2025-04-14 由 Python大师

分享举报

如何从数据框列中提取特定部分并填充到另一列中？

1 个回答

撰写回答