Python Excel字符串列差异

2024-06-16 10:13:29 发布

您现在位置:Python中文网/ 问答频道 /正文

| Text1                          | Text2                   | Change         |
|:-------------------------------|:------:-----------------| -----:---------|
| This is Mango. This is Banana  | This is Banana          | This is Mango. |
| This is Mango.                 | This is Mango, Banana   | , Banana       |

希望如上所述从Text1和Text2派生Change列。上面一个是excel数据/数据框

下面的代码可以很好地处理文本,但不能处理数据帧

import difflib

定义原始文本 摘自:https://en.wikipedia.org/wiki/Internet_Information_Services

original = ["About the IIS", "", "IIS 8.5 has several improvements related", "to performance in large-scale scenarios, such", "as those used by commercial hosting providers and Microsoft's", "own cloud offerings."]

定义修改的文本

edited = ["About the IIS", "", "It has several improvements related", "to performance in large-scale scenarios."]

启动不同的对象

d = difflib.Differ()

计算两个文本之间的差异

diff = d.compare(original, edited)

输出结果

print ('\n'.join(diff))

=>;输出如下

 python comparing-strings-difflib.py
  About the IIS
  
- IIS 8.5 has several improvements related
?  ^^^^^^

+ It has several improvements related
?  ^

- to performance in large-scale scenarios, such
?                                        ^^^^^^

+ to performance in large-scale scenarios.
?                       

Tags: toin文本isperformanceiisthisbanana
1条回答
网友
1楼 · 发布于 2024-06-16 10:13:29

创建diff是为了比较字符串(特别是source code

你应该使用

df['Change'] = df.apply(function_name, axis=1) 

在每行上运行自己的函数并比较行中的两个文本

但是使用Diff()来获取更改是没有用的,因为它以文本形式给出结果。
您应该使用SequenceMatcher将其作为元组获取

(operation, text1_start, text1_end, text2_start, text2_end) 

要创建DataFrame,我必须在列Text2中添加缺少的行,并使用None识别缺少的行

diff_row_number = len(original) - len(edited)

if diff_row_number > 0:
    edited = edited + [None]*diff_row_number
elif diff_row_number < 0:    
    original = original + [None]*(-diff_row_number)

如果使用空字符串"",则不需要部分if text2 is None:


最小工作代码

import pandas as pd
import difflib

original = ["About the IIS", "", "IIS 8.5 has several improvements related", "to performance in large-scale scenarios, such", "as those used by commercial hosting providers and Microsoft's", "own cloud offerings."]
edited = ["About the IIS", "", "It has several improvements related", "to performance in large-scale scenarios."]

#d = difflib.Differ()
#diff = d.compare(original, edited)

#for line in diff:
#    print(line)

# fill missing lines    
diff_row_number = len(original) - len(edited)

if diff_row_number > 0:
    edited = edited + [None]*diff_row_number
elif diff_row_number < 0:    
    original = original + [None]*(-diff_row_number)

df = pd.DataFrame({
   'Text1': original,
   'Text2': edited,
})

def compare_row(row):
    text1, text2 = row
    
    if text2 is None:
        return 'remove: ' + text1
    else:
        sm = difflib.SequenceMatcher(a=text1, b=text2)
        opcodes = sm.get_opcodes()
        
        changes = []
        
        for item in opcodes:
            if item[0] != 'equal':
                name, a1,a2, b1,b2 = item
                changes.append( name + ' : ' + text1[a1:a2] + ' : ' + text2[b1:b2] )
                
        return '\n'.join(changes)

#  - main  -

df['Change'] = df.apply(compare_row, axis=1)

print(df['Change'])

结果:

0                                                     
1                                                     
2                                 replace : IS 8.5 : t
3                                 replace : , such : .
4    remove: as those used by commercial hosting pr...
5                         remove: own cloud offerings.
Name: Change, dtype: object

编辑:

空字符串也一样None

import pandas as pd
import difflib

original = ["About the IIS", "", "IIS 8.5 has several improvements related", "to performance in large-scale scenarios, such", "as those used by commercial hosting providers and Microsoft's", "own cloud offerings."]
edited = ["About the IIS", "", "It has several improvements related", "to performance in large-scale scenarios."]

#d = difflib.Differ()
#diff = d.compare(original, edited)

#for line in diff:
#    print(line)

# fill missing lines    
diff_row_number = len(original) - len(edited)

if diff_row_number > 0:
    edited = edited + [""]*diff_row_number
elif diff_row_number < 0:    
    original = original + [""]*(-diff_row_number)    

df = pd.DataFrame({
   'Text1': original,
   'Text2': edited,
})

def compare_row(row):
    text1, text2 = row
    
    sm = difflib.SequenceMatcher(a=text1, b=text2)
    opcodes = sm.get_opcodes()
    
    changes = []
    
    for item in opcodes:
        if item[0] != 'equal':
            name, a1,a2, b1,b2 = item
            changes.append( name + ' : ' + text1[a1:a2] + ' : ' + text2[b1:b2] )
            
    return '\n'.join(changes)

#  - main  -

df['Change'] = df.apply(compare_row, axis=1)

print(df['Change'])

结果:(delete而不是remove

0                                                     
1                                                     
2                                 replace : IS 8.5 : t
3                                 replace : , such : .
4    delete : as those used by commercial hosting p...
5                     delete : own cloud offerings. : 
Name: Change, dtype: object

相关问题 更多 >