Python Excel字符串列差异

1条回答

网友

1楼 · 发布于 2024-06-16 10:13:29

创建diff是为了比较字符串（特别是source code）

你应该使用

df['Change'] = df.apply(function_name, axis=1)

在每行上运行自己的函数并比较行中的两个文本

但是使用Diff()来获取更改是没有用的，因为它以文本形式给出结果。
您应该使用SequenceMatcher将其作为元组获取

(operation, text1_start, text1_end, text2_start, text2_end)

要创建DataFrame，我必须在列Text2中添加缺少的行，并使用None识别缺少的行

diff_row_number = len(original) - len(edited)

if diff_row_number > 0:
    edited = edited + [None]*diff_row_number
elif diff_row_number < 0:    
    original = original + [None]*(-diff_row_number)

如果使用空字符串""，则不需要部分if text2 is None:

最小工作代码

import pandas as pd
import difflib

original = ["About the IIS", "", "IIS 8.5 has several improvements related", "to performance in large-scale scenarios, such", "as those used by commercial hosting providers and Microsoft's", "own cloud offerings."]
edited = ["About the IIS", "", "It has several improvements related", "to performance in large-scale scenarios."]

#d = difflib.Differ()
#diff = d.compare(original, edited)

#for line in diff:
#    print(line)

# fill missing lines    
diff_row_number = len(original) - len(edited)

if diff_row_number > 0:
    edited = edited + [None]*diff_row_number
elif diff_row_number < 0:    
    original = original + [None]*(-diff_row_number)

df = pd.DataFrame({
   'Text1': original,
   'Text2': edited,
})

def compare_row(row):
    text1, text2 = row
    
    if text2 is None:
        return 'remove: ' + text1
    else:
        sm = difflib.SequenceMatcher(a=text1, b=text2)
        opcodes = sm.get_opcodes()
        
        changes = []
        
        for item in opcodes:
            if item[0] != 'equal':
                name, a1,a2, b1,b2 = item
                changes.append( name + ' : ' + text1[a1:a2] + ' : ' + text2[b1:b2] )
                
        return '\n'.join(changes)

#  - main  -

df['Change'] = df.apply(compare_row, axis=1)

print(df['Change'])

结果:

0                                                     
1                                                     
2                                 replace : IS 8.5 : t
3                                 replace : , such : .
4    remove: as those used by commercial hosting pr...
5                         remove: own cloud offerings.
Name: Change, dtype: object

编辑：

空字符串也一样None

import pandas as pd
import difflib

original = ["About the IIS", "", "IIS 8.5 has several improvements related", "to performance in large-scale scenarios, such", "as those used by commercial hosting providers and Microsoft's", "own cloud offerings."]
edited = ["About the IIS", "", "It has several improvements related", "to performance in large-scale scenarios."]

#d = difflib.Differ()
#diff = d.compare(original, edited)

#for line in diff:
#    print(line)

# fill missing lines    
diff_row_number = len(original) - len(edited)

if diff_row_number > 0:
    edited = edited + [""]*diff_row_number
elif diff_row_number < 0:    
    original = original + [""]*(-diff_row_number)    

df = pd.DataFrame({
   'Text1': original,
   'Text2': edited,
})

def compare_row(row):
    text1, text2 = row
    
    sm = difflib.SequenceMatcher(a=text1, b=text2)
    opcodes = sm.get_opcodes()
    
    changes = []
    
    for item in opcodes:
        if item[0] != 'equal':
            name, a1,a2, b1,b2 = item
            changes.append( name + ' : ' + text1[a1:a2] + ' : ' + text2[b1:b2] )
            
    return '\n'.join(changes)

#  - main  -

df['Change'] = df.apply(compare_row, axis=1)

print(df['Change'])

结果：（delete而不是remove）

0                                                     
1                                                     
2                                 replace : IS 8.5 : t
3                                 replace : , such : .
4    delete : as those used by commercial hosting p...
5                     delete : own cloud offerings. : 
Name: Change, dtype: object

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python Excel字符串列差异

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >