如何根据条件获取df A中每条记录对应的df B中的总记录数

1 投票

1 回答

27 浏览

提问于 2025-04-14 18:22

我有两个数据表，长得像这样：

第一个表叫做 services_df

服务ID	域名
111	www.abc.com
222	xyz.com
333	www.opq.com
444	rst.com

第二个表叫做 subscriptions_df

订阅ID	域名	状态
11	abc.com	活跃
22	abc.com	活跃
33	www.xyz.com	已取消
44	rst.com	暂停

我想在第一个表中添加一个新的活跃/暂停订阅总数列，这个列要显示第二个表中对应域名的活跃订阅总数。因为这两个表的数据量都很大（大约6万到10万条），所以我希望能尽量高效地完成这个任务。

服务ID	域名	活跃/暂停订阅总数
111	abc.com	2
222	xyz.com	0
333	opq.com	#N/A
444	rst.com	1

我想出了一个函数来实现这个功能，但效率不是很好。

def numberOfActiveSubsTiedToDomainInServices(domain):
  #remove www and trim spaces
  domain = domain.replace('www.','').replace(' ','')
  #retrieve a count of active uber active services tied to the domain found in either domain or the domain in the service description
  try:
    return len(subscriptions_df.loc[(subscriptions_df['Domain'].astype(str).replace('www.','').replace(' ','') == domain) & (subscriptions_df['Status'].isin(['Active','Suspended']))])
  except:
    return '#N/A'

services_df['Total Active/Suspended Subs'] = services_df['Domain'].map(numberOfActiveSubsTiedToDomainInServices)

我遇到的问题是，这种方法非常耗时，因为需要花费太长时间，而且我还需要对其他列进行类似的统计。

有没有更有效率的 Python 方法可以做到这一点呢？

性能优化数据处理数据分析统计计算数据表数据合并订阅管理活跃状态

1 个回答

试试这个：

# first, make sure the domain names are the same in df1, df2
services_df["Domain"] = services_df["Domain"].str.removeprefix("www.")
subscriptions_df["Domain"] = subscriptions_df["Domain"].str.removeprefix("www.")

# make a crosstab from df2
tmp = pd.crosstab(subscriptions_df["Domain"], subscriptions_df["Status"])[
    ["Active", "suspended"]
].sum(axis=1)

# map the result from crosstab to df1
services_df["Total Active/Suspended Subs"] = services_df["Domain"].map(tmp)

print(services_df)

输出结果是：

   Service id   Domain  Total Active/Suspended Subs
0         111  abc.com                          2.0
1         222  xyz.com                          0.0
2         333  opq.com                          NaN
3         444  rst.com                          1.0

回答于 2025-04-14 由 Python大师

分享举报

如何根据条件获取df A中每条记录对应的df B中的总记录数

1 个回答

撰写回答