将Perl脚本转换为Python:根据哈希键去重两个文件
我刚开始学习Python,想请问有没有人能把一个比较简单的Perl脚本转换成Python?
这个脚本会读取两个文件,通过比较哈希键,只输出第二个文件中独特的行。同时,它还会把重复的行输出到一个文件里。我发现用Perl来去重的速度非常快,所以想看看Python的表现如何。
#! /usr/bin/perl
## Compare file1 and file2 and output only the unique lines from file2.
## Opening file1.txt and store the data in a hash.
open my $file1, '<', "file1.txt" or die $!;
while ( <$file1> ) {
my $name = $_;
$file1hash{$name}=$_;
}
## Opening file2.txt and store the data in a hash.
open my $file2, '<', "file2.txt" or die $!;
while ( <$file2> ) {
$name = $_;
$file2hash{$name}=$_;
}
open my $dfh, '>', "duplicate.txt";
## Compare the keys and remove the duplicate one in the file2 hash
foreach ( keys %file1hash ) {
if ( exists ( $file2hash{$_} ))
{
print $dfh $file2hash{$_};
delete $file2hash{$_};
}
}
open my $ofh, '>', "file2_clean.txt";
print $ofh values(%file2hash) ;
我在两个都有超过一百万行的文件上测试了Perl和Python的脚本,总共花的时间不到6秒。对于这个业务需求来说,性能真是太棒了!
我修改了Kriss提供的脚本,对结果非常满意:1)脚本的性能,2)我修改脚本的过程非常简单,让它变得更灵活:
#!/usr/bin/env python
import os
filename1 = raw_input("What is the first file name to compare? ")
filename2 = raw_input("What is the second file name to compare? ")
file1set = set([line for line in file(filename1)])
file2set = set([line for line in file(filename2)])
for name, results in [
(os.path.abspath(os.getcwd()) + "/duplicate.txt", file1set.intersection(file2set)),
(os.path.abspath(os.getcwd()) + "/" + filename2 + "_clean.txt", file2set.difference(file1set))]:
with file(name, 'w') as fh:
for line in results:
fh.write(line)
3 个回答
3
这里有一个稍微不同的解决方案,它在处理非常大的文件时更节省内存。这个方法只会为原始文件创建一个集合,因为似乎没有必要一次性把整个文件2都放到内存里:
with open("file1.txt", "r") as file1:
file1set = set(line.rstrip() for line in file1)
with open("file2.txt", "r") as file2:
with open("duplicate.txt", "w") as dfh:
with open("file2_clean.txt", "w") as ofh:
for line in file2:
if line.rstrip() in file1set:
dfh.write(line) # duplicate line
else:
ofh.write(line) # not duplicate
注意,如果你想在比较时包含行尾的空格和换行符,可以把第二个 line.rstrip()
替换成 line
,并把第二行简化为:
file1set = set(file1)
另外,从Python 3.1开始,with
语句可以同时处理多个项目,所以这三个 with
语句可以合并成一个。
4
又一个变种(只是和其他提议的语法变化,其实用Python有很多种方法可以做到这一点)。
file1set = set([line for line in file("file1.txt")])
file2set = set([line for line in file("file2.txt")])
for name, results in [
("duplicate.txt", file1set.intersection(file2set)),
("file2_clean.txt", file2set.difference(file1set))]:
with file(name, 'w') as fh:
for line in results:
fh.write(line)
顺便提一下,我们也应该贡献一个Perl版本,之前提到的那个看起来不太像Perl……下面是我Python版本的Perl等价代码。看起来和最初的版本差别不大。我想强调的是,在提议的答案中,问题不仅仅是算法的,也和语言无关,比如Perl和Python之间的区别。
use strict;
open my $file1, '<', "file1.txt" or die $!;
my %file1hash = map { $_ => 1 } <$file1>;
open my $file2, '<', "file2.txt" or die $!;
my %file2hash = map { $_ => 1 } <$file2>;
for (["duplicate.txt", [grep $file1hash{$_}, keys(%file2hash)]],
["file2_clean.txt", [grep !$file1hash{$_}, keys(%file2hash)]]){
my ($name, $results) = @$_;
open my $fh, ">$name" or die $!;
print $fh @$results;
}
7
如果你对顺序不太在意,可以在Python中使用集合(sets):
file1=set(open("file1").readlines())
file2=set(open("file2").readlines())
intersection = file1 & file2 #common lines
non_intersection = file2 - file1 #uncommon lines (in file2 but not file1)
for items in intersection:
print items
for nitems in non_intersection:
print nitems
还有其他方法,比如使用difflib和filecmp这两个库。
另外一种方法就是单纯使用列表比较。
# lines in file2 common with file1
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
line=line.rstrip()
if line in data1:
print line
# lines in file2 not in file1, use "not"
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
line=line.rstrip()
if not line in data1:
print line