Pandas通过不同时间范围的两个数据框连接

import pandas as pd companies = pd.DataFrame({'CompanyName': ['A', 'B', 'C'], 'EarningsDate': ['2013/01/15', '2015/03/25', '2017/05/03']}) companies['EarningsDate'] = pd.to_datetime(companies.EarningsDate) news = pd.DataFrame({'CompanyName': ['A', 'A', 'A', 'B', 'B', 'C'], 'NewsDate': ['2012/02/01', '2013/01/10', '2015/05/13' , '2012/05/23', '2013/01/03', '2017/05/01']}) news['NewsDate'] = pd.to_datetime(news.NewsDate)

company_count = [] other_count = [] for _, company in companies.iterrows(): end_date = company.EarningsDate start_date = end_date - pd.DateOffset(years=1) subset = news[(news.NewsDate > start_date) & (news.NewsDate < end_date)] mask = subset.CompanyName==company.CompanyName company_count.append(subset[mask].shape[0]) other_count.append(subset[~mask].groupby('CompanyName').size().mean()) companies['12MonCompanyNewsCount'] = pd.Series(company_count) companies['12MonOtherNewsCount'] = pd.Series(other_count).fillna(0)

CompanyName EarningsDate 12MonCompanyNewsCount 12MonOtherNewsCount 0 A 2013-01-15 2 2 1 B 2015-03-25 0 0 2 C 2017-05-03 1 0

2条回答

网友

1楼 · 编辑于 2024-04-23 07:40:19

我找不到不迭代companies行的方法。但是，您可以为companies设置一个开始日期列，遍历companies的行，并为符合条件的news的日期和公司名称创建布尔索引。然后只需执行一个布尔and操作，并对得到的布尔数组求和。在

我发誓当你看到密码的时候会更有意义。在

# create the start date column and the 12 month columns,
# fill the 12 month columns with zeros for now
companies['startdate'] = companies.EarningsDate - pd.DateOffset(years=1)
companies['12MonCompanyNewsCount'] = 0
companies['12MonOtherNewsCount'] = 0

# iterate the rows of companies and hold the index
for i, row in companies.iterrows():
    # create a boolean index when the news date is after the start date
    # and when the news date is before the end date
    # and when the company names match
    ix_start = news.NewsDate >= row.startdate
    ix_end = news.NewsDate <= row.EarningsDate
    ix_samename = news.CompanyName == row.CompanyName
    # set the news count value for the current row of `companies` using
    # boolean `and` operations on the indices.  first when the names match
    # and again when the names don't match.
    companies.loc[i,'12MonCompanyNewsCount'] = (ix_start & ix_end & ix_samename).sum()
    companies.loc[i,'12MonOtherNewsCount'] = (ix_start & ix_end & ~ix_samename).sum()

companies
#returns:

  CompanyName EarningsDate  startdate  12MonCompanyNewsCount  \
0           A   2013-01-15 2012-01-15                      1
1           B   2015-03-25 2014-03-25                      0
2           C   2017-05-03 2016-05-03                      1

   12MonOtherNewsCount
0                    2
1                    1
2                    0

网友

2楼 · 编辑于 2024-04-23 07:40:19

好的，给你。在

要获得12MonCompanyNewsCount，可以使用merge_asof，这真的很好：

companies['12MonCompanyNewsCount'] = pd.merge_asof(
    news, 
    companies, 
    by='CompanyName',
    left_on='NewsDate',
    right_on='EarningsDate',
    tolerance=pd.Timedelta('365D'),
    direction='forward'
).groupby('CompanyName').count().NewsDate

它的工作速度大约是当前实现的两倍（并且可以更好地扩展）

对于12MonOtherNewsCount，我真的无法想出一种不循环的方法来完成它。我想这是一个更简洁的，虽然：

^{pr2}$

看起来确实有点快。在

相关问题更多 >

编程相关推荐

热门问题

热门文章