将Perl脚本转换为Python:根据哈希键去重两个文件

1 投票
3 回答
1551 浏览
提问于 2025-04-15 16:17

我刚开始学习Python,想请问有没有人能把一个比较简单的Perl脚本转换成Python?

这个脚本会读取两个文件,通过比较哈希键,只输出第二个文件中独特的行。同时,它还会把重复的行输出到一个文件里。我发现用Perl来去重的速度非常快,所以想看看Python的表现如何。

#! /usr/bin/perl

## Compare file1 and file2 and output only the unique lines from file2.

## Opening file1.txt and store the data in a hash.
open my $file1, '<', "file1.txt" or die $!;
while ( <$file1> ) {
    my $name = $_;
    $file1hash{$name}=$_;
}
## Opening file2.txt and store the data in a hash.
open my $file2, '<', "file2.txt" or die $!;

while  ( <$file2> ) {
    $name = $_;
    $file2hash{$name}=$_;
}

open my $dfh, '>', "duplicate.txt";

## Compare the keys and remove the duplicate one in the file2 hash
foreach ( keys %file1hash ) {
    if ( exists ( $file2hash{$_} ))
    {
    print $dfh $file2hash{$_};
    delete $file2hash{$_};
    }
}

open my $ofh, '>', "file2_clean.txt";
print  $ofh values(%file2hash) ;

我在两个都有超过一百万行的文件上测试了Perl和Python的脚本,总共花的时间不到6秒。对于这个业务需求来说,性能真是太棒了!

我修改了Kriss提供的脚本,对结果非常满意:1)脚本的性能,2)我修改脚本的过程非常简单,让它变得更灵活:

#!/usr/bin/env python

import os

filename1 = raw_input("What is the first file name to compare? ")
filename2 = raw_input("What is the second file name to compare? ")

file1set = set([line for line in file(filename1)])
file2set = set([line for line in file(filename2)])

for name, results in [
    (os.path.abspath(os.getcwd()) + "/duplicate.txt", file1set.intersection(file2set)),
    (os.path.abspath(os.getcwd()) + "/" + filename2 + "_clean.txt", file2set.difference(file1set))]:
    with file(name, 'w') as fh:
        for line in results:
            fh.write(line)

3 个回答

3

这里有一个稍微不同的解决方案,它在处理非常大的文件时更节省内存。这个方法只会为原始文件创建一个集合,因为似乎没有必要一次性把整个文件2都放到内存里:

with open("file1.txt", "r") as file1:
    file1set = set(line.rstrip() for line in file1)

with open("file2.txt", "r") as file2:
    with open("duplicate.txt", "w") as dfh:
        with open("file2_clean.txt", "w") as ofh:
            for line in file2:
                if line.rstrip() in file1set:
                    dfh.write(line)     # duplicate line
                else:
                    ofh.write(line)     # not duplicate

注意,如果你想在比较时包含行尾的空格和换行符,可以把第二个 line.rstrip() 替换成 line,并把第二行简化为:

    file1set = set(file1)

另外,从Python 3.1开始,with 语句可以同时处理多个项目,所以这三个 with 语句可以合并成一个。

4

又一个变种(只是和其他提议的语法变化,其实用Python有很多种方法可以做到这一点)。

file1set = set([line for line in file("file1.txt")])
file2set = set([line for line in file("file2.txt")])

for name, results in [
    ("duplicate.txt", file1set.intersection(file2set)),
    ("file2_clean.txt", file2set.difference(file1set))]:
    with file(name, 'w') as fh:
        for line in results:
            fh.write(line)

顺便提一下,我们也应该贡献一个Perl版本,之前提到的那个看起来不太像Perl……下面是我Python版本的Perl等价代码。看起来和最初的版本差别不大。我想强调的是,在提议的答案中,问题不仅仅是算法的,也和语言无关,比如Perl和Python之间的区别。

use strict;

open my $file1, '<', "file1.txt" or die $!;
my %file1hash = map { $_ => 1 } <$file1>;

open my $file2, '<', "file2.txt" or die $!;
my %file2hash = map { $_ => 1 } <$file2>;

for (["duplicate.txt", [grep $file1hash{$_}, keys(%file2hash)]],
     ["file2_clean.txt", [grep !$file1hash{$_}, keys(%file2hash)]]){
    my ($name, $results) = @$_;
    open my $fh, ">$name" or die $!;
    print $fh @$results;
}
7

如果你对顺序不太在意,可以在Python中使用集合(sets):

file1=set(open("file1").readlines())
file2=set(open("file2").readlines())
intersection = file1 & file2 #common lines
non_intersection = file2 - file1  #uncommon lines (in file2 but not file1)
for items in intersection:
    print items
for nitems in non_intersection:
    print nitems

还有其他方法,比如使用difflib和filecmp这两个库。

另外一种方法就是单纯使用列表比较。

# lines in file2 common with file1
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if line in data1:
        print line

# lines in file2 not in file1, use "not"
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if not line in data1:
        print line

撰写回答