Pandas/python连接/合并列表列上的两个数据帧

2024-06-16 11:12:01 发布

您现在位置:Python中文网/ 问答频道 /正文

让我们考虑两个数据帧:^ {< CD1>}和^ {< CD2>}:

数据帧Person

+---+-----------+-------------------+-----------------------------+-----------------------------------------+
|   |    nconst |       primaryName |           primaryProfession |                          knownForTitles |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 0 | nm0000103 |      Fairuza Balk |          actress,soundtrack | tt0181875,tt0089908,tt0120586,tt0115963 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 1 | nm0000106 |    Drew Barrymore | producer,actress,soundtrack | tt0120888,tt0343660,tt0151738,tt0120631 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 2 | nm0000117 |     Neve Campbell | actress,producer,soundtrack | tt0134084,tt1262416,tt0120082,tt0117571 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 3 | nm0000132 |      Claire Danes | actress,producer,soundtrack | tt0274558,tt0108872,tt1796960,tt0117509 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+
| 4 | nm0000138 | Leonardo DiCaprio |       actor,producer,writer | tt0120338,tt0993846,tt1375666,tt0407887 |
+---+-----------+-------------------+-----------------------------+-----------------------------------------+

数据帧Movie

+---+-----------+-----------+---------------------+-----------------------+
|   |    tconst | titleType |       originalTitle |                genres |
+---+-----------+-----------+---------------------+-----------------------+
| 0 | tt0192789 |     movie | While Supplies Last |        Comedy,Musical |
+---+-----------+-----------+---------------------+-----------------------+
| 1 | tt4914592 |     movie |      Electric Heart | Adventure,Drama,Music |
+---+-----------+-----------+---------------------+-----------------------+
| 2 | tt4999994 |     movie |           Rain Doll |                 Drama |
+---+-----------+-----------+---------------------+-----------------------+
| 3 | tt2690572 |     movie |             Polaris |                 Drama |
+---+-----------+-----------+---------------------+-----------------------+
| 4 | tt1562859 |     movie |           Golmaal 3 |         Action,Comedy |
+---+-----------+-----------+---------------------+-----------------------+

如您所见knownForTitlesfrom Person是来自Movie数据帧的tconst列表

问题:

  1. 我如何计算“有多少actors在一部action电影中扮演过角色?”
  2. 有多少演员主演了不止一种类型的电影

Tags: producer数据电影moviepersoncd1cd2drama
2条回答

我正在学习熊猫,所以我很有可能在这方面走错了方向。也就是说,让我们试一试:

首先,让我们看看是否可以找到df电影中所有动作片的行。看着Pandas dataframe select rows where a list-column contains any of a list of strings,我想到了这个:

Movies['isAction'] = [ 'Action'  in x for x in Movies['genres'].tolist()  ] 

结果如下:

      tconst titleType      originalTitle                     genres  isAction
0  tt0407887     movie  WhileSuppliesLast          [Comedy, Musical]     False
1  tt1375666     movie      ElectricHeart  [Adventure, Drama, Music]     False
2  tt4999994     movie           RainDoll                    [Drama]     False
3  tt2690572     movie            Polaris                    [Drama]     False
4  tt0134084     movie           Golmaal3           [Action, Comedy]      True

我将isAction列添加到Movies df。我还更改了一些tconst值,以便我们可以得到一些积极的结果(第0、1和4行已更改)

我更改了row 4,这样Neve Cambelle就会出现在结果中

我们现在可以制作tconst动作片列表:

 listOfActionMovies = Movies[ Movies["isAction"] == True]["tconst"].tolist()

现在再次使用来自Pandas dataframe select rows where a list-column contains any of a list of strings的解决方案:

Person["inAction"] = pd.DataFrame(Person.knownForTitles.tolist()).isin( listOfActionMovies ).any(1)

这将产生:

      nconst       primaryName                primaryProfession                                knownForTitles  inAction
0  nm0000103       FairuzaBalk            [actress, soundtrack]  [tt0181875, tt0089908, tt0120586, tt0115963]     False
1  nm0000106     DrewBarrymore  [producer, actress, soundtrack]  [tt0120888, tt0343660, tt0151738, tt0120631]     False
2  nm0000117      NeveCampbell  [actress, producer, soundtrack]  [tt0134084, tt1262416, tt0120082, tt0117571]      True
3  nm0000132       ClaireDanes  [actress, producer, soundtrack]  [tt0274558, tt0108872, tt1796960, tt0117509]     False
4  nm0000138  LeonardoDiCaprio        [actor, producer, writer]  [tt0120338, tt0993846, tt1375666, tt0407887]     False

现在我们终于可以计算出动作片中的所有People

len(Person[ Person["inAction"] == True ])

len()解决方案由get dataframe row count based on conditions提供

首先,我们创建person作为数据帧:

columns = ['nconst', 'primaryName', 'primaryProfession', 'knownForTitles',]

data = [
('nm0000103',      'Fairuza Balk',          'actress,soundtrack', 'tt0181875,tt0089908,tt0120586,tt0115963'),
('nm0000106',    'Drew Barrymore', 'producer,actress,soundtrack', 'tt0120888,tt0343660,tt0151738,tt0120631'),
('nm0000117',     'Neve Campbell', 'actress,producer,soundtrack', 'tt0134084,tt1262416,tt0120082,tt0117571'),
('nm0000132',      'Claire Danes', 'actress,producer,soundtrack', 'tt0274558,tt0108872,tt1796960,tt0117509'),
('nm0000138', 'Leonardo DiCaprio',       'actor,producer,writer', 'tt0120338,tt0993846,tt1375666,tt0407887'),
]

person = pd.DataFrame(data=data, columns=columns)

其次,我们将字符串拆分为两列的列表:

for field in ['primaryProfession', 'knownForTitles']:
    person[field] = person[field].str.split(',')

第三,我们使用explode函数将一行转换为多行:

person = person.explode('knownForTitles').explode('primaryProfession')

第四,我们只选择演员/演员作为主要职业:

actor_actress = person[ person['primaryProfession'].isin(['actress', 'actor'])]

现在,我们有了一个所谓的整洁格式的数据框(每个单元格都有一个值,而不是一个列表):

    nconst     primaryName   primaryProfession knownForTitles
0   nm0000103  Fairuza Balk   actress          tt0181875
0   nm0000103  Fairuza Balk   actress          tt0089908
0   nm0000103  Fairuza Balk   actress          tt0120586
0   nm0000103  Fairuza Balk   actress          tt0115963
1   nm0000106  Drew Barrymore actress          tt0120888

在这一点上,我们可以对电影数据帧重复这些步骤,然后加入演员(使用knownfortles)和电影(使用tconst)

对不起,回复的时间太长了。这种方法的关键点是使用str.split(','),然后使用explode()将数据帧转换为适合联接、合并等的格式

相关问题 更多 >