Python。从文本文件中获取数据并放入datafram

All Donor Types Deceased Donor Living Donor All Donor States of Residence To Date 360,673 205,858 154,815 2018 7,107 4,394 2,713 2017 16,478 10,286 6,192 2016 15,944 9,971 5,973 2015 15,071 9,079 5,992 2014 14,415 8,596 5,819 Data subject to change based on future data submission or correction. Donor : Donor Type by Donor State of Residence, Donation Year Page 2 of 70 Donors Recovered : January 1, 1988 - May 31, 2018 For Format = Landscape Based on OPTN data as of July 4, 2018 All Donor Types Deceased Donor Living Donor 1993 7,766 4,861 2,905 1992 7,091 4,520 2,571 Alabama To Date 5,926 3,471 2,455 2018 95 65 30 2017 259 172 87 2016 249 175 74 Alaska To Date 935 565 370 2018 14 9 5 2017 42 32 10 2016 30 22 8 Data subject to change based on future data submission or correction. Donor : Donor Type by Donor State of Residence, Donation Year Page 70 of 70 Donors Recovered : January 1, 1988 - May 31, 2018 For Format = Landscape Based on OPTN data as of July 4, 2018 All Donor Types Deceased Donor Living Donor 1989 16 12 4 1988 16 11 5

state year all deceased living Alabama 2018 95 65 30 Alabama 2017 259 172 87 Alabama 2016 249 175 74 Alaska 2018 14 9 5 Alaska 2017 42 32 10 Alaska 2016 30 22 8 Alaska 1989 16 12 4 Alaska 1988 16 11 5

import pandas as pd fname = "optn.txt" fh = open(fname) count = 0 state=['Alabama','Alaska','Arizona','Arkansas', 'California','Colorado','Connecticut','Delaware', 'District of Columbia','Florida'] year=['2018','2017','2016','2015','2014','2013','2012', '2011','2010','2009','2008','2007','2006','2005','2004', '2003','2002','2001','2000','1999','1998','1997','1996', '1995','1994','1993','1992','1991','1990','1989','1988'] optny=list() for line in fh: line = line.strip() #print(line) if not line.startswith(tuple(year)):continue optny.append(line) #break print(optny)

1条回答

网友

1楼 · 发布于 2024-05-17 00:42:53

这似乎是一个正则表达式很有用的任务。你知道吗

请注意，对于我当前的解决方案，格式需要与示例中的格式相同。你知道吗

首先，识别不必要的字符串

import re

clean_pattern = re.compile(
    r"(^[A-Z].+)|All Donor Types Deceased Donor  Living Donor", 
    re.MULTILINE
)

此模式匹配以大写字母开头的行，如“数据主题…”，但忽略以空格开头，后跟其他字符的行。第二部分还匹配“所有捐赠者类型…”行。你知道吗

接下来，使用以下模式查找状态和表内容

state_pattern = re.compile(
    r"^\s+(?P<state>[a-zA-Z]+)\s+To Date[0-9, ]+\n(?P<content>[0-9, \n]+)$", 
    re.MULTILINE
)

现在，我假设状态是由一个词组成的，它们是句子中的第一个词，后面跟着“到目前为止”。另外，由于文本是事先清理的，所以它应该只包含数字和逗号以外的数据。接下来的单词将形成一个不同的状态/内容条目。你知道吗

最后，定义数据模式

data_pattern = re.compile(
    r"(?P<year>[0-9]{4})\s+(?P<all>[0-9,]+)\s+(?P<deceased>[0-9,]+)\s+(?P<living>[0-9,]+)"
)

定义了模式之后，现在可以提取数据了（假设全文存储在text）

data = []
# remove the unwanted lines
cleaned_text = clean_pattern.sub('', text)
# iterate over state / content matches
for state_match in state_pattern.finditer(cleaned_text):
    info_dict = state_match.groupdict()
    # iterate over data matches
    for match in data_pattern.finditer(info_dict['content']):
        data_dict = match.groupdict()
        # add the state information to the data
        data_dict['state'] = info_dict['state']
        data.append(data_dict)

pd.DataFrame(data, columns=['state', 'year', 'all', 'deceased', 'living'])

相关问题更多 >

编程相关推荐

热门问题

热门文章