合并两个没有公共列的数据帧

2024-05-14 21:18:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在将列“state”添加到与其他数据帧不共享公共列的现有数据帧中。因此,我需要将zipcodes转换成状态(例如,00704将是PR)以加载到具有新列状态的dataframe中。在

reviewers = pd.read_csv('reviewers.txt', 
                        sep='|',
                        header=None,
                        names=['user id','age','gender','occupation','zipcode'])
reviewers['state'] = ""

  user id  age gender       occupation    zipcode    state
0          1   24      M     technician   85711      
1          2   53      F          other   94043      


zipcodes = pd.read_csv('zipcodes.txt',
                  usecols = [1,4],
                  converters={'Zipcode':str})
      Zipcode State
0       00704    PR
1       00704    PR
2       00704    PR
3       00704    PR
4       00704    PR


zipcodes1 = zipcodes.set_index('Zipcode') ###Setting the index to zipcode
dfzip = zipcodes1
print(dfzip)


        State
Zipcode      
00704      PR
00704      PR
00704      PR



zips = (pd.Series(dfzip.values.tolist(), index = zipcodes1['State'].index))

states = []
for zipcode in reviewers['Zipcode']:
    if re.search('[a-zA-Z]+', zipcode):
        append.states['canada']
    elif zipcode in zips.index:
        append.states(zips['zipcode'])
    else:
        append.states('unkown')

我也不确定我的循环是否正确。我必须按美国邮政编码(数字)、加拿大邮政编码(字母顺序)和我们定义的其他邮政编码(未知)对邮政编码进行排序。如果你需要数据文件,请告诉我。在


Tags: indexprpdstate邮政编码zipcodestateszips
2条回答

你的循环需要修复:

states = []
for zipcode in reviewers['Zipcode']:
    if re.match(r'\w+', zipcode):
        states.extend('Canada')
    elif zipcode in zips.index:
        states.extend(zips[zipcode])
    else:
        states.extend('Unknown')

另外,假设您希望将states列表重新插入到dataframe中。在这种情况下,你不需要for循环。您可以在dataframe上使用pandas apply来获取新列:

^{pr2}$

使用:

#remove duplicates and create Series for mapping
zips = zipcodes.drop_duplicates().set_index('Zipcode')['State']

#get mask for canada zip codes
#if possible small letters change to [a-zA-Z]+
mask = reviewers['zipcode'].str.match('[A-Z]+') 
#new column by mask
reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
#NaNs are replaced 
reviewers['state'] = reviewers['state'].fillna('unknown')

具有apply的循环版本:

^{pr2}$

计时

In [56]: %%timeit
    ...: mask = reviewers['zipcode'].str.match('[A-Z]+') 
    ...: reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
    ...: reviewers['state'] = reviewers['state'].fillna('unknown')
    ...: 
100 loops, best of 3: 2.08 ms per loop

In [57]: %%timeit
    ...: reviewers['State1'] = reviewers['zipcode'].apply(f)
    ...: 
100 loops, best of 3: 17 ms per loop

相关问题 更多 >

    热门问题