从CSV文件中提取字母数字文本
我有一个位置网格(从A到I和从1到9),这个网格在一个平面文件(*.csv)中以各种形式出现,有时会包含空格和随机大小写,比如:9-H、@ b 3、e-4、d4、c6、5h、C2、i9等等。这些都是字母a到i和数字1到9的任意组合,还可能包括空格、~、@和-。
那么,有什么好的方法来提取这些字母和数字呢?理想情况下,输出可以放在“备注”之前的另一列,或者放在另一个文本文件里。我能看懂脚本并理解它们的功能,但还不太会自己写。
示例输入文件:
Record Notes
46651 Adrian reported green-pylons are in central rack. (e-4)
46652 Jose enetered location of triangles in the uppur corner. (b/c6)
46207 [Location: 5h] Gabe located the long pipes in the near the far corner.
46205 Committee-reports are in boxes in holding area, @ b 3).
45164 Caller-nu,mbers @ 1A
45165 All carbon rod tackles 3 F and short (top rack)
45166 USB(3 Port) in C2
45167 Full tackle in b2.
45168 5b; USB(4 port)
45073 SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD.
45169 Persistent CORDS ~i9
45170 Deliverate handball moved to D-2 on instructions from Pete
45440 slides and overheads + contact-sheets to 9-H (top bin).
45441 d7-slides and negatives (black and white)
<eof>
期望的输出(以字母数字格式,放在同一个文件或新文件中)
Record Location Notes
46651 E4
46652 C6
46205 A1
...
46169 I9
也就是说,总是提取后面的字符。
好吧,大家,在遇到“使用未初始化的值$note进行模式匹配(m//)”的错误后,我就继续尝试,部分成功了。
# # starts with anything then space or punctuation then letter then number
if ($note =~ /.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/) {
$note =~ s/.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x;
# # starts line with letter then number
} elsif ($note =~ /^([a-iA-I])[\s\p{Punct}]*([0-9]).*/) {
$note =~ s/^([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x;
# # after punctuation then number
} elsif ($note =~ /.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/) {
$note =~ s/.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x;
# # beginning of line with number
} elsif ($note =~ /^([0-9])[\s\p{Punct}]*([a-iA-I]).*/) {
$note =~ s/^([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x;
# # empty line or no record of any grid location except "#7 asdfg" format
} elsif ($note=~ "") {
$note = "##";
}
脚本不太成功的情况是遇到像99994和99993这样的记录。
99999 norecordofgridhere --
99998
99997 box #7 entered the array with out invoice.
99996 was down in h 7 and the coachela was in e 8 when I found off-field.
99994 cartons in office after 4 buckets
99993 6 boxes in office file cabinet top-shelf
现在的输出是:
99999 # # norecordofgridhere --
99998 # #
99997 E 7 box #7 entered the array with out invoice.
99996 E 8 was down in h 7 and the coachela was in e 8 when I found off-field.
99994 B 4 cartons in office after 4 buckets
99993 B 6 6 boxes in office file cabinet top-shelf
99994和99993应该有#的标记。我哪里出错了?我该如何修复这个问题?
我觉得有更简单的方法,比如使用Text::CSV_XS,但我在使用草莓perl时遇到了一些问题,即使测试模块安装得当也没有用。所以我又回到了activestateperl。
2 个回答
使用 Text::CSV_XS 来解析CSV文件,这个工具又快又准。
接下来,构建一个正则表达式来匹配这些ID。
最后,对每个ID进行标准化处理。
#!/usr/bin/perl
use v5.10;
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
# Build up the regular expression to look for IDs
my $Separator_Set = qr{ [- ] }x;
my $ID_Letters_Set = qr{ [a-i] }xi;
my $ID_Numbers_Set = qr{ [1-9] }x;
my $Location_Re = qr{
\b
$ID_Letters_Set $Separator_Set? $ID_Numbers_Set |
$ID_Numbers_Set $Separator_Set? $ID_Letters_Set
\b
}x;
# Initialize Text::CSV_XS and tell it this is a tab separated CSV
my $csv = Text::CSV_XS->new({
sep_char => "\t", # tab separated fields
}) or die "Cannot use CSV: ".Text::CSV_XS->error_diag ();
# Read in and discard the CSV header line.
my $headers = $csv->getline(*DATA);
# Output our own header line
say "Record\tLocation\tNotes";
# Read each CSV row, extract and normalize the ID, and output a new row.
while( my $row = $csv->getline(*DATA) ) {
my($record, $notes) = @$row;
# Extract and normalize the ID
my($id) = $notes =~ /($Location_Re)/;
$id = normalize_id($id);
# Output a new row
printf "%d\t%s\t%s\n", $record, $id, $notes;
}
sub normalize_id {
my $id = shift;
# Return empty string if we were passed in a blank
return '' if !defined $id or !length $id or $id !~ /\S/;
my($letter) = $id =~ /($ID_Letters_Set)/;
my($number) = $id =~ /($ID_Numbers_Set)/;
return uc($letter).$number;
}
__END__
Record Notes
46651 Adrian reported green-pylons are in central rack. (e-4)
46652 Jose enetered location of triangles in the uppur corner. (b/c6)
46207 [Location: 5h] Gabe located the long pipes in the near the far corner.
46205 Committee-reports are in boxes in holding area, @ b 3).
45164 Caller-nu,mbers @ 1A
45165 All carbon rod tackles 3 F and short (top rack)
45166 USB(3 Port) in C2
45167 Full tackle in b2.
45168 5b; USB(4 port)
45073 SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD.
45169 Persistent CORDS ~i9
45170 Deliverate handball moved to D-2 on instructions from Pete
45440 slides and overheads + contact-sheets to 9-H (top bin).
45441 d7-slides and negatives (black and white)
在编程中,有时候我们会遇到一些问题,可能是因为代码写得不够好,或者是我们对某些概念理解得不够透彻。比如说,有人可能在使用某个函数时,发现它的表现和预期不一样。这种情况就需要我们仔细检查代码,看看是不是哪里出了问题。
另外,编程语言有很多不同的特性和用法,有些可能会让初学者感到困惑。比如,某些语言允许我们用不同的方式来实现同样的功能,这就需要我们多加练习,才能找到最适合自己的方法。
总之,遇到问题时,不要着急,慢慢分析,查阅资料,或者向其他人请教,都是解决问题的好办法。
...
my $coord;
if ($note =~ /
(?&DEL)
( (?&ROW) (?&SEP)?+ (?&COL)
| (?&COL) (?&SEP)?+ (?&ROW)
)
(?&DEL)
(?(DEFINE)
(?<ROW> [a-hA-H] )
(?<COL> [1-9] )
(?<SEP> [\s~\@\-]++ )
(?<DEL> ^ | \W | \z )
)
/x) {
$coord = $1;
( my $row = uc($coord) ) =~ s/[^A-H]//g;
( my $col = uc($coord) ) =~ s/[^1-9]//g;
$coord = "$row$col";
}
...