如何根据每个数据帧中的列的值,高效地将来自另一个数据帧的新列添加到数据帧中?

2024-06-17 15:22:07 发布

您现在位置:Python中文网/ 问答频道 /正文

感谢您的到来,我有两个数据框,一个叫做“新闻测试”,它存储了300万条新闻,另一个是“公司名称”,存储了28万个公司名称(带有模糊名称)。以下是一些例子:

  1. 新闻测验 +=======+===========================================================================+ | index | content | +=======+===========================================================================+ | 0 | Apple and Google are two of the strongest companies in the world. | +-------+---------------------------------------------------------------------------+ | 1 | Working in Facebook and Google is my dream, however, it is still a dream. | +-------+---------------------------------------------------------------------------+
  2. 公司名称 +=======+========+==============+=======================+ | index | ID | Company_Name | Company_FuzzyName_new | +=======+========+==============+=======================+ | 0 | 123456 | Apple Inc. | Apple Inc.|Apple | +-------+--------+--------------+-----------------------+ | 1 | 789111 | Google LLC | Google LLC|Google | +-------+--------+--------------+-----------------------+ | 2 | 333333 | Facebook | Facebook|FB | +-------+--------+--------------+-----------------------+

现在,如果“Company_FuzzyName_new”(数据框:Company_fuzzy_name,以|分隔)中的任何一个名称与“content”(数据框:news_test)中的任何单词匹配,我将在news_test中添加一个名为“Com”的新列,并且Company_fuzzy___name中的值是“ID”。因此,根据上述示例,结果将为:

+=======+===========================================================================+==================+
| index |                                  content                                  |       Com        |
+=======+===========================================================================+==================+
|   0   | Apple and Google are two of the strongest companies in the world.         | [123456, 789111] |
+-------+---------------------------------------------------------------------------+------------------+
|   1   | Working in Facebook and Google is my dream, however, it is still a dream. | [789111, 333333] |
+-------+---------------------------------------------------------------------------+------------------+

我已经有了下面的代码,它是有效的 `

list_total = []
for i in range(0, len(news_test)):
    list_match = []
    for j in range(0, len(company_fuzzy_name)):
        if bool(re.search(company_fuzzy_name.iloc[j]['Company_FuzzyName_new'], news_test.iloc[i]['content'].encode('utf-8'))) == True:
            list_match.append(company_fuzzy_name.iloc[j]['ID'])
        else:
            continue
    list_total.append(list_match)
news_test['Com'] = list_total

`

但是,这个太慢了(因为3M*280K),我想知道有没有办法加快实现时间,或者重组代码以提高效率?“Com”列中的表单不是固定的,它可以是列表、字符串等。 谢谢你的帮助

我的Python环境是2.7


Tags: andthenameintest名称applefacebook