Perl或Python:将日期从dd/mm/yyyy转换为yyyy-mm-dd
我在一个CSV文件里有很多日期,它们的格式是日/月/年,比如17/01/2010,我想把它们转换成年-月-日的格式,也就是2010-01-17。
我该怎么用Perl或者Python来实现这个转换呢?
8 个回答
使用 Time::Piece(从 5.9.5 版本开始就有),它和 Python 的解决方案非常相似,因为它提供了 strptime 和 strftime 这两个功能:
use Time::Piece;
my $dt_str = Time::Piece->strptime('13/10/1979', '%d/%m/%Y')->strftime('%Y-%m-%d');
或者
$ perl -MTime::Piece
print Time::Piece->strptime('13/10/1979', '%d/%m/%Y')->strftime('%Y-%m-%d');
1979-10-13
$
如果你有的数据格式非常规范,只包含一个日期,格式是DD-MM-YYYY,那么这个方法就可以用:
# FIRST METHOD
my $ndate = join("-" => reverse split(m[/], $date));
这个方法可以处理像 $date
里有 "07/04/1776" 这样的日期,但对于 "this 17/01/2010 and that 01/17/2010 there" 就不行了。为了避免这个问题,可以用:
# SECOND METHOD
($ndate = $date) =~ s{
\b
( \d \d )
/ ( \d \d )
/ ( \d {4} )
\b
}{$3-$2-$1}gx;
如果你想要一个更“语法化”的正则表达式,这样更容易维护和更新,可以使用这个:
# THIRD METHOD
($ndate = $date) =~ s{
(?&break)
(?<DAY> (?&day) )
(?&slash) (?<MONTH> (?&month) )
(?&slash) (?<YEAR> (?&year) )
(?&break)
(?(DEFINE)
(?<break> \b )
(?<slash> / )
(?<year> \d {4} )
(?<month> \d {2} )
(?<day> \d {2} )
)
}{
join "-" => @+{qw<YEAR MONTH DAY>}
}gxe;
最后,如果你有Unicode数据,可能需要更加小心。
# FOURTH METHOD
($ndate = $date) =~ s{
(?&break_before)
(?<DAY> (?&day) )
(?&slash) (?<MONTH> (?&month) )
(?&slash) (?<YEAR> (?&year) )
(?&break_after)
(?(DEFINE)
(?<slash> / )
(?<start> \A )
(?<finish> \z )
# don't really want to use \D or [^0-9] here:
(?<break_before>
(?<= [\pC\pP\pS\p{Space}] )
| (?<= \A )
)
(?<break_after>
(?= [\pC\pP\pS\p{Space}]
| \z
)
)
(?<digit> \d )
(?<year> (?&digit) {4} )
(?<month> (?&digit) {2} )
(?<day> (?&digit) {2} )
)
}{
join "-" => @+{qw<YEAR MONTH DAY>}
}gxe;
你可以看看这四种方法在处理这些示例输入字符串时的表现:
my $sample = q(17/01/2010);
my @strings = (
$sample, # trivial case
# multiple case
"this $sample and that $sample there",
# multiple case with non-ASCII BMP code points
# U+201C and U+201D are LEFT and RIGHT DOUBLE QUOTATION MARK
"from \x{201c}$sample\x{201d} through\xA0$sample",
# multiple case with non-ASCII code points
# from both the BMP and the SMP
# code point U+02013 is EN DASH, props \pP \p{Pd}
# code point U+10179 is GREEK YEAR SIGN, props \pS \p{So}
# code point U+110BD is KAITHI NUMBER SIGN, props \pC \p{Cf}
"\x{10179}$sample\x{2013}\x{110BD}$sample",
);
现在让 $date
作为一个 foreach
迭代器遍历那个数组,我们得到这个输出:
Original is: 17/01/2010
First method: 2010-01-17
Second method: 2010-01-17
Third method: 2010-01-17
Fourth method: 2010-01-17
Original is: this 17/01/2010 and that 17/01/2010 there
First method: 2010 there-01-2010 and that 17-01-this 17
Second method: this 2010-01-17 and that 2010-01-17 there
Third method: this 2010-01-17 and that 2010-01-17 there
Fourth method: this 2010-01-17 and that 2010-01-17 there
Original is: from “17/01/2010” through 17/01/2010
First method: 2010-01-2010” through 17-01-from “17
Second method: from “2010-01-17” through 2010-01-17
Third method: from “2010-01-17” through 2010-01-17
Fourth method: from “2010-01-17” through 2010-01-17
Original is: 17/01/2010–17/01/2010
First method: 2010-01-2010–17-01-17
Second method: 2010-01-17–2010-01-17
Third method: 2010-01-17–2010-01-17
Fourth method: 2010-01-17–2010-01-17
假设你确实想匹配非ASCII数字,比如:
U+660 ARABIC-INDIC DIGIT ZERO
U+661 ARABIC-INDIC DIGIT ONE
U+662 ARABIC-INDIC DIGIT TWO
U+663 ARABIC-INDIC DIGIT THREE
U+664 ARABIC-INDIC DIGIT FOUR
U+665 ARABIC-INDIC DIGIT FIVE
U+666 ARABIC-INDIC DIGIT SIX
U+667 ARABIC-INDIC DIGIT SEVEN
U+668 ARABIC-INDIC DIGIT EIGHT
U+669 ARABIC-INDIC DIGIT NINE
甚至可以是
U+1D7F6 MATHEMATICAL MONOSPACE DIGIT ZERO
U+1D7F7 MATHEMATICAL MONOSPACE DIGIT ONE
U+1D7F8 MATHEMATICAL MONOSPACE DIGIT TWO
U+1D7F9 MATHEMATICAL MONOSPACE DIGIT THREE
U+1D7FA MATHEMATICAL MONOSPACE DIGIT FOUR
U+1D7FB MATHEMATICAL MONOSPACE DIGIT FIVE
U+1D7FC MATHEMATICAL MONOSPACE DIGIT SIX
U+1D7FD MATHEMATICAL MONOSPACE DIGIT SEVEN
U+1D7FE MATHEMATICAL MONOSPACE DIGIT EIGHT
U+1D7FF MATHEMATICAL MONOSPACE DIGIT NINE
想象一下,你有一个用数学等宽字体写的日期,像这样:
$date = "\x{1D7F7}\x{1D7FD}/\x{1D7F7}\x{1D7F6}/\x{1D7F8}\x{1D7F6}\x{1D7F7}\x{1D7F6}";
Perl代码在这个上面可以正常工作:
Original is: //
First method: --
Second method: --
Third method: --
Fourth method: --
我觉得你会发现Python的Unicode模型相当糟糕,它对抽象字符和字符串的支持不足,使得写这样的代码变得非常困难。
在Python中,写可读的正则表达式也很难,因为你不能把子表达式的声明和执行分开,(?(DEFINE)...)
这样的块在Python中不支持。实际上,Python甚至不支持Unicode属性。因为这个原因,Python并不适合处理Unicode正则表达式。
不过,如果你觉得Python和Perl相比已经很糟糕(确实是),那你试试其他语言吧。我还没找到一门语言在这方面比Python更好。
如你所见,当你从多种语言中寻找正则表达式解决方案时,会遇到真正的问题。首先,由于不同的正则表达式风格,解决方案很难比较。而且没有其他语言能在正则表达式的强大、表达能力和可维护性上与Perl相比。一旦涉及到任意Unicode,这种差异会更加明显。
所以如果你只想要Python的解决方案,那你应该只问这个。否则,这就是一个非常不公平的比赛,Python几乎总是会输;在Python中处理这样的事情太麻烦了,更不用说要做到既正确又干净了。这对Python来说要求太高了。
相比之下,Perl的正则表达式在这两方面都表现得很好。
>>> from datetime import datetime
>>> datetime.strptime('02/11/2010', '%d/%m/%Y').strftime('%Y-%m-%d')
'2010-11-02'
>>> '-'.join('02/11/2010'.split('/')[::-1])
'2010-11-02'
>>> '-'.join(reversed('02/11/2010'.split('/')))
'2010-11-02'
或者有一种更“黑客”的方法(这种方法不检查值的有效性):