合并和连接

name something else1 something else2 0 nm1 sm1 lol1 1 nm2 sm2 lol2 2 nm3 sm3 lol3 3 nm4 sm4 lol4 4 nm5 sm5 lol5 5 nm6 sm6 lol6

name something else1 something else2 0 nm1 sm1 lol 1 nm2 sm2 lol2 2 nm3 sm3 lol3 3 nm4 sm4 lol4 4 nm5 sm5 lol5 5 nm6 sm6 lol6 7 nm7 bla bla 8 nm8 bla bla 9 nm9 bla bla 10 nm10 bla bla 11 nm11 bla bla 12 nm12 bla bla

name year something else1 something else2 0 nm1 2014 sm1 lol1 1 nm2 2014 sm2 lol2 2 nm3 2014 sm3 lol3 3 nm4 2014 sm4 lol4 4 nm5 2014 sm5 lol5 5 nm6 2014 sm6 lol6

spatial paths= list of all names of files (first element is spatial_search_intensity//2004_spatial_diabetic ketoacidosis.csv) df5 = pd.read_csv("directory in google drive"+str(spatial_paths[0])) df5 = df5.set_index("Name") df5 for s_path in spatial_paths: variable_name= re.findall("\d{4}_spatial_(.+).csv",s_path) year = re.findall("(\d{4})_spatial_.+\.csv",s_path) df_new = pd.read_csv("directory in google drive"+str(s_path)) df_new= df_new.set_index("Name") df5 = pd.merge(df5,df_new, left_index=True,right_index=True) df5

3条回答

网友

1楼 · 编辑于 2024-06-08 03:18:51

我将使用pathlib和pandas的组合

import pandas as pd 
from pathlib import Path
import re 


p = Path('files')

dfs = {
    int(re.search('\d{4}', f.stem)[0]) : 
                     pd.read_csv(f).assign(src=int(re.search('\d{4}', f.stem)[0])) 
                     for f in p.glob('*.csv')
}

现在，您可以按数据帧各自的年份访问它们

dfs[2000]

    A   src
0   1   2000
1   2   2000
2   3   2000

网友

2楼 · 编辑于 2024-06-08 03:18:51

我想我知道你想要什么，但如果你不得不调整的话，也许这会给你一些想法。我创建了6个文件，每年2个，它们的列名相似，但数据不同。意见一致

# first get a list of all files that will need to be read
all_files = glob.glob("*_file*.csv")

# find all years for those files (in my case three years)
years = list(set(re.findall("(\d{4})_", ', '.join(all_files))))

#iterate over years concatenating each file to the right (instead of merging assuming the rows and names are equivalent - then dedup column names and add year.
dfyearslist = []
for year in years:
    # get the the year's files in question
    yearly_files = glob.glob(year+"_file*.csv")
    # print(yearly_files)
    dflist = []
    for f in yearly_files:
        dft =pd.read_csv(f, sep=',')
        dflist.append(dft)
    df = pd.concat(dflist,axis=1) #axis = 1, horizontal
    df = df.loc[:,~df.columns.duplicated()]
    df['year'] = year
    dfyearslist.append(df)
df_final = pd.concat(dfyearslist) # defaults axis = 0, vertical`enter code here`
print(df_final)

输出

我的专栏在这三年中每年都被命名为cola和colb

   name cola  colb  year
0   nm1  sm1  lol1  2014
1   nm2  sm2  lol2  2014
2   nm3  sm3  lol3  2014
3   nm4  sm4  lol4  2014
4   nm5  sm5  lol5  2014
5   nm6  sm6  lol6  2014
0  nm10  sm1  lol1  2015
1  nm11  sm2  lol2  2015
2  nm12  sm3  lol3  2015
3  nm13  sm4  lol4  2015
4  nm14  sm5  lol5  2015
5  nm15  sm6  lol6  2015
0  nm20  sm1  lol1  2016
1  nm21  sm2  lol2  2016
2  nm22  sm3  lol3  2016
3  nm23  sm4  lol4  2016
4  nm24  sm5  lol5  2016
5  nm25  sm6  lol6  2016

网友

3楼 · 编辑于 2024-06-08 03:18:51

听起来像是“powerquery”任务，加载所有csv并选择“transform”。您可以从它们中添加年份（添加自定义列），在新查询中合并所有文件并应用您的逻辑

相关问题更多 >

编程相关推荐

热门问题

热门文章