使用Python进行医学信息提取
我是一名护士,虽然我会用Python,但并不是专家,只是用它来处理DNA序列。
我们有一些医院记录,都是用人类语言写的,我需要把这些数据放进数据库或CSV文件里,但这些记录超过5000行,工作量很大。所有数据的格式都是一致的,给你看个例子:
11/11/2010 - 09:00am : He got nausea, vomiting and died 4 hours later
我应该得到以下数据:
Sex: Male
Symptoms: Nausea
Vomiting
Death: True
Death Time: 11/11/2010 - 01:00pm
再举个例子:
11/11/2010 - 09:00am : She got heart burn, vomiting of blood and died 1 hours later in the operation room
而我得到的是:
Sex: Female
Symptoms: Heart burn
Vomiting of blood
Death: True
Death Time: 11/11/2010 - 10:00am
顺序不太一致。当我说“in”时,这个“in”是一个关键词,后面的文字是一个地点,直到我找到另一个关键词为止。
一开始“他”或“她”会确定性别,接着是一些症状,症状之间用分隔符分开,可能是逗号、连字符等,但在同一行中是保持一致的。
比如“died ... hours later”也需要记录多少小时,有时候病人还活着,或者已经出院等等。
这就是说,我们有很多约定,我觉得如果能用关键词和模式来切分文本,就能完成这个工作。所以,如果你知道有什么有用的函数、模块、教程或工具来做到这一点,最好是Python的(如果没有Python的,图形界面的工具也不错),请告诉我。
一些额外信息:
there are a lot of rules to express various medical data but here are few examples
- Start with the same date/time format followed by a space followd by a colon followed by a space followed by He/She followed space followed by rules separated by and
- Rules:
* got <symptoms>,<symptoms>,....
* investigations were done <investigation>,<investigation>,<investigation>,......
* received <drug or procedure>,<drug or procedure>,.....
* discharged <digit> (hour|hours) later
* kept under observation
* died <digit> (hour|hours) later
* died <digit> (hour|hours) later in <place>
other rules do exist but they follow the same idea
4 个回答
也许这对你也有帮助,不过还没有经过测试。
import collections
import datetime
import re
retrieved_data = []
Data = collections.namedtuple('Patient', 'Sex, Symptoms, Death, Death_Time')
dict_data = {'Death':'',
'Death_Time':'',
'Sex' :'',
'Symptoms':''}
with open('data.txt') as f:
for line in iter(f.readline, ""):
date, text = line.split(" : ")
if 'died' in text:
dict_data['Death'] = True
dict_data['Death_Time'] = datetime.datetime.strptime(date,
'%d/%m/%Y - %I:%M%p')
hours = re.findall('[\d]+', datetime.text)
if hours:
dict_data['Death_Time'] += datetime.timedelta(hours=int(hours[0]))
if 'she' in text:
dict_data['Sex'] = 'Female'
else:
dict_data['Sex'] = 'Male'
symptoms = text[text.index('got'):text.index('and')].split(',')
dict_data['Symptoms'] = '\n'.join(symptoms)
retrieved_data.append(Data(**dict_data))
# EDIT : Reset the data dictionary.
dict_data = {'Death':'',
'Death_Time':'',
'Sex' :'',
'Symptoms':''}
这里有几种可能的解决方法:
- 使用正则表达式 - 根据你文本中的模式来定义正则表达式。匹配这些表达式,提取出你需要的内容,然后对所有记录重复这个过程。这种方法需要你对数据的格式有很好的理解,当然也需要懂一些正则表达式的知识 :)
- 字符串操作 - 这种方法相对简单一些。同样,你需要对数据的格式有一定的了解。这就是我下面所做的。
- 机器学习 - 你可以定义所有的规则,并根据这些规则训练一个模型。之后,这个模型会尝试根据你提供的规则来提取数据。这种方法比前两种更通用,但实现起来也是最复杂的。
看看这些方法是否适合你,可能需要做一些调整。
new_file = open('parsed_file', 'w')
for rec in open("your_csv_file"):
tmp = rec.split(' : ')
date = tmp[0]
reason = tmp[1]
if reason[:2] == 'He':
sex = 'Male'
symptoms = reason.split(' and ')[0].split('He got ')[1]
else:
sex = 'Female'
symptoms = reason.split(' and ')[0].split('She got ')[1]
symptoms = [i.strip() for i in symptoms.split(',')]
symptoms = '\n'.join(symptoms)
if 'died' in rec:
died = 'True'
else:
died = 'False'
new_file.write("Sex: %s\nSymptoms: %s\nDeath: %s\nDeath Time: %s\n\n" % (sex, symptoms, died, date))
每条记录是用换行符 \n
分隔的,而你没有提到一个病人的记录是用两个换行符 \n\n
来分隔的。
后来: @Nurse 你最后是怎么做的?我很好奇。
这个内容使用了 dateutil 来解析日期,比如说 '11/11/2010 - 09:00am',还用了 parsedatetime 来解析相对时间,比如 '4 hours later':
import dateutil.parser as dparser
import parsedatetime.parsedatetime as pdt
import parsedatetime.parsedatetime_consts as pdc
import time
import datetime
import re
import pprint
pdt_parser = pdt.Calendar(pdc.Constants())
record_time_pat=re.compile(r'^(.+)\s+:')
sex_pat=re.compile(r'\b(he|she)\b',re.IGNORECASE)
death_time_pat=re.compile(r'died\s+(.+hours later).*$',re.IGNORECASE)
symptom_pat=re.compile(r'[,-]')
def parse_record(astr):
match=record_time_pat.match(astr)
if match:
record_time=dparser.parse(match.group(1))
astr,_=record_time_pat.subn('',astr,1)
else: sys.exit('Can not find record time')
match=sex_pat.search(astr)
if match:
sex=match.group(1)
sex='Female' if sex.lower().startswith('s') else 'Male'
astr,_=sex_pat.subn('',astr,1)
else: sys.exit('Can not find sex')
match=death_time_pat.search(astr)
if match:
death_time,date_type=pdt_parser.parse(match.group(1),record_time)
if date_type==2:
death_time=datetime.datetime.fromtimestamp(
time.mktime(death_time))
astr,_=death_time_pat.subn('',astr,1)
is_dead=True
else:
death_time=None
is_dead=False
astr=astr.replace('and','')
symptoms=[s.strip() for s in symptom_pat.split(astr)]
return {'Record Time': record_time,
'Sex': sex,
'Death Time':death_time,
'Symptoms': symptoms,
'Death':is_dead}
if __name__=='__main__':
tests=[('11/11/2010 - 09:00am : He got nausea, vomiting and died 4 hours later',
{'Sex':'Male',
'Symptoms':['got nausea', 'vomiting'],
'Death':True,
'Death Time':datetime.datetime(2010, 11, 11, 13, 0),
'Record Time':datetime.datetime(2010, 11, 11, 9, 0)}),
('11/11/2010 - 09:00am : She got heart burn, vomiting of blood and died 1 hours later in the operation room',
{'Sex':'Female',
'Symptoms':['got heart burn', 'vomiting of blood'],
'Death':True,
'Death Time':datetime.datetime(2010, 11, 11, 10, 0),
'Record Time':datetime.datetime(2010, 11, 11, 9, 0)})
]
for record,answer in tests:
result=parse_record(record)
pprint.pprint(result)
assert result==answer
print
输出结果是:
{'Death': True,
'Death Time': datetime.datetime(2010, 11, 11, 13, 0),
'Record Time': datetime.datetime(2010, 11, 11, 9, 0),
'Sex': 'Male',
'Symptoms': ['got nausea', 'vomiting']}
{'Death': True,
'Death Time': datetime.datetime(2010, 11, 11, 10, 0),
'Record Time': datetime.datetime(2010, 11, 11, 9, 0),
'Sex': 'Female',
'Symptoms': ['got heart burn', 'vomiting of blood']}
注意:解析日期的时候要小心。比如 '8/9/2010' 是指8月9日,还是9月8日呢?所有记录的人都用同样的格式吗?如果你选择使用 dateutil(我觉得这是处理不太固定格式日期字符串的最佳选择),一定要阅读 dateutil 文档 中关于“格式优先级”的部分,这样你就能(希望能)正确解析 '8/9/2010'。如果你不能保证所有记录的人都用相同的日期格式,那么这个脚本的结果就需要手动检查了。无论如何,这样做可能都是明智的。