使用awk或perl按键对文件进行排序，就像不预先排序的join一样

import pandas as pd from random import shuffle a = ['bar','qux','baz','foo','spam'] df = pd.DataFrame({'nam':a,'asc':[1,2,3,4,5],'desc':[5,4,3,2,1]}) shuffle(a) print(a) dex = pd.DataFrame({'dex' : a}) df_b = pd.DataFrame({'VAL1' :[0,1,2,3,4,5,6]}) pd.merge(dex, df,left_on='dex',right_on='nam')[['asc','desc','nam']]

3条回答

网友

1楼 · 编辑于 2024-04-25 09:51:30

重要的是not to split不必要。如果你有足够的内存，把较小的文件放入一个散列，然后读取第二个文件应该可以。你知道吗

考虑以下示例（请注意，此脚本的运行时包括创建示例数据所需的时间）：

#!/usr/bin/env perl

use strict;
use warnings;

# This is a string containing 10 lines corresponding to your "file one"
# Second column has the record ID
# Normally, you'd be reading this from a file

my $big_file = join "\n",
    map join("\t", 'x', $_, ('x') x 3_000_000),
    1 .. 10
;

# This is a string containing 10 lines corresponding to your "file two"
# Second column has the record ID

my $small_file = join "\n",
    map join("\t", 'y', $_, ('y') x 10),
    1 .. 10
;

# You would normally pass file names as arguments

join_with_big_file(
    \$small_file,
    \$big_file,
);

sub join_with_big_file {
    my $small_records = load_small_file(shift);
    my $big_file = shift;

    open my $fh, '<', $big_file
        or die "Cannot open '$big_file': $!";

    while (my $line = <$fh>) {
        chomp $line;
        my ($first, $id, $rest) = split /\t/, $line, 3;
        print join("\t", $first, $id, $rest, $small_records->{$id}), "\n";
    }

    return;
}

sub load_small_file {
    my $file = shift;
    my %records;

    open my $fh, '<', $file
        or die "Cannot open '$file' for reading: $!";

    while (my $line = <$fh>) {
        # limit the split
        my ($first, $id, $rest) = split /\t/, $line, 3;
        # I drop the id field here so it is not duplicated in the joined
        # file. If that is not a problem, $records{$id} = $line
        # would be better.
        $records{$id} = join("\t", $first, $rest);
    }

    return \%records;
}

网友

2楼 · 编辑于 2024-04-25 09:51:30

如果file1的大小为GB，并且有300万列数据，那么就只有很少的行数（不超过200行）。虽然不能将所有行本身加载到内存中，但可以轻松地加载它们的所有位置。你知道吗

use feature qw( say );

use Fcntl qw( SEEK_SET );

open(my $fh1, '<', $qfn1) or die("Can't open \"$qfn1\": $!\n");
open(my $fh2, '<', $qfn2) or die("Can't open \"$qfn2\": $!\n");

my %offsets;
while (1) {
   my $offset = tell($fh1);
   my $row1 = <$fh1>;
   last if !defined($row1);

   chomp($row1);
   my @fields1 = split(/\t/, $row1);
   my $key = $fields1[1];
   $offsets{$key} = $offset;
}

while (my $row2 = <$fh2>) {
   chomp($row2);
   my @fields2 = split(/\t/, $row2);
   my $key = $fields2[1];
   my $offset = $offsets{$key};
   if (!defined($offset)) {
      warn("Key $key not found.\n");
      next;
   }

   seek($fh1, $offset, SEEK_SET);
   my $row1 = <$fh1>;
   chomp($row1);
   my @fields1 = split(/\t/, $row1);

   say join "\t", @fields2, @fields1[6..$#fields1];
}

在Python中也可以采用这种方法。你知道吗

注意：如果顺序更灵活，则存在一个更简单的解决方案（即，如果您对输出的顺序表示满意，因为记录是在file1中排序的）。这个假设file2很容易放在内存中。你知道吗

网友

3楼 · 编辑于 2024-04-25 09:51:30

三百万列数据，是吗？听起来你在做NLP的工作。你知道吗

假设这是真的，并且矩阵是稀疏的，python可以很好地处理它（只是不能使用pandas）。看看scipy.sparse。示例：

from scipy.sparse import dok_matrix

A = dok_matrix((10,10))
A[1,1] = 1

B = dok_matrix((10,10))
B[2,2] = 2

print A+B

DOK代表“密钥字典”，通常用于构建稀疏矩阵，然后根据用途将其转换为CSR等。见available sparse matrix types。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章