如何从非结构化字符串中提取dd/mm/yyyy格式的日期?

2024-04-20 09:43:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我有几个字符串如下:

'Thursday;60 days;Monday, days;the last two years;the six months;October 2017;March 2018;three days;Jan. 4;Last year;Dec. 21;'

预期结果:October 2017

'January 7;30;39;24;46;1750;April 2017;April 30;February;'

预期结果:April 2017

'Thursday;a day;another six days;the day;Tuesday;three days;mid-October;Wednesday;'

预期结果:mid-October

我知道字符串是完全非结构化的,但是我们能有一个python代码来获取日期吗?你知道吗

这是NER模型的一部分,我试图从中提取数据实体。你知道吗

我尝试过一些方法,但这些方法甚至都不接近结果,因为字符串没有正确的模式


Tags: the方法字符串dayslastthreetwomonday
1条回答
网友
1楼 · 发布于 2024-04-20 09:43:47

可以将^{}与正则表达式一起使用,以检查找到的日期时间字符串中的月份名称:

import datefinder, re
from datetime import datetime

strs = ['Thursday;60 days;Monday, days;the last two years;the six months;October 2017;March 2018;three days;Jan. 4;Last year;Dec. 21;',
        'January 7;30;39;24;46;1750;April 2017;April 30;February;',
        'Thursday;a day;another six days;the day;Tuesday;three days;mid-October;Wednesday;']

day_of_week_rx = re.compile(r'(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)', re.I)
for s in strs:
    raw_dates = list(datefinder.find_dates(s, source=True))
    print([y for x,y in raw_dates if day_of_week_rx.search(y)])

输出:

['October 2017', 'March 2018', 'Jan. 4', 'Dec. 21']
['January 7', 'April 2017', 'April 30']
[]

请注意,mid-October不能强制转换为有效的日期时间,因此它不会被提取。您需要应用一些更具体的正则表达式,比如re.search(r'\b(?:half|mid)-(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)', text)。你知道吗

(?:A(?:pr(?:il)?|ug(?:ust)?)|Dec(?:ember)?|Feb(?:ruary)?|J(?:an(?:uary)?|u(?:ly|ne|[ln]))|Ma(?:rch|[ry])|Nov(?:ember)?|Oct(?:ober)?|Sep(?:tember)?)匹配英文月份全名和缩写名。你知道吗

相关问题 更多 >