合并具有相同列值的连续行

14 投票

2 回答

13013 浏览

提问于 2025-04-18 16:08

我有一个东西，看起来像这样。

我该如何从这个：

    0             d
0   The         DT
1   Skoll       ORGANIZATION
2   Foundation  ORGANIZATION
3   ,           ,
4   based       VBN
5   in          IN
6   Silicon     LOCATION
7   Valley      LOCATION

变成这个：

    0                       d
0   The                     DT
1   Skoll Foundation        ORGANIZATION
3   ,                       ,
4   based                   VBN
5   in                      IN
6   Silicon Valley          LOCATION

数据处理数据清洗行合并

2 个回答

我其实觉得@chrisb的groupby方法更好，不过如果有可能出现不连续的重复值，你需要再创建一个groupby的键变量来跟踪这些值。不过对于小问题来说，这种方法可以快速解决。

我觉得在这种情况下，使用基本的迭代器会更简单，而不是尝试使用pandas的函数。我能想象使用groupby的场景，但如果第二个变量重复的话，保持连续性就会变得很困难。

这个方法可能还可以进一步简化，下面是一个示例：

df = DataFrame({'a': ['The', 'Skoll', 'Foundation', ',', 
                      'based', 'in', 'Silicon', 'Valley'], 
                'b': ['DT', 'Org', 'Org', ',', 'VBN', 'IN', 
                      'Location', 'Location']})

# Initialize result lists with the first row of df
result1 = [df['a'][0]]  
result2 = [df['b'][0]]

# Use zip() to iterate over the two columns of df simultaneously,
# making sure to skip the first row which is already added
for a, b in zip(df['a'][1:], df['b'][1:]):
    if b == result2[-1]:        # If b matches the last value in result2,
        result1[-1] += " " + a  # add a to the last value of result1
    else:  # Otherwise add a new row with the values
        result1.append(a)
        result2.append(b)

# Create a new dataframe using these result lists
df = DataFrame({'a': result1, 'b': result2})

回答于 2025-04-18 由 Python大师

分享举报

@rfan的回答当然是有效的，作为一种替代方案，这里介绍一种使用pandas的groupby的方法。

.groupby()是根据'b'这一列来对数据进行分组的，sort=False是为了保持原有的顺序不变。.apply()则是对每一组'b'数据应用一个函数，在这个例子中，就是把字符串用空格连接起来。

In [67]: df.groupby('b', sort=False)['a'].apply(' '.join)
Out[67]: 

b
DT                       The
Org         Skoll Foundation
,                          ,
VBN                    based
IN                        in
Location      Silicon Valley
Name: a, dtype: object

编辑：

为了处理更一般的情况（重复的非连续值），一种方法是先添加一个哨兵列，用来跟踪每一行属于哪个连续数据组，像这样：

df['key'] = (df['b'] != df['b'].shift(1)).astype(int).cumsum()

然后把这个关键字加到groupby中，这样即使有重复值也能正常工作。例如，使用这个包含重复值的示例数据：

df = DataFrame({'a': ['The', 'Skoll', 'Foundation', ',', 
                      'based', 'in', 'Silicon', 'Valley', 'A', 'Foundation'], 
                'b': ['DT', 'Org', 'Org', ',', 'VBN', 'IN', 
                      'Location', 'Location', 'Org', 'Org']})

应用groupby：

In [897]: df.groupby(['key', 'b'])['a'].apply(' '.join)
Out[897]: 
key  b       
1    DT                       The
2    Org         Skoll Foundation
3    ,                          ,
4    VBN                    based
5    IN                        in
6    Location      Silicon Valley
7    Org             A Foundation
Name: a, dtype: object

回答于 2025-04-18 由 Python大师

分享举报

合并具有相同列值的连续行

2 个回答

撰写回答