pandas根据拥有姓名列表的列获取最常见的姓名

star_rating actors_list 0 9.3 [u'Tim Robbins', u'Morgan Freeman'] 1 9.2 [u'Marlon Brando', u'Al Pacino', u'James Caan'] 2 9.1 [u'Al Pacino', u'Robert De Niro'] 3 9.0 [u'Christian Bale', u'Heath Ledger'] 4 8.9 [u'John Travolta', u'Uma Thurman']

import pandas as pd df= pd.read_table (r'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv',sep=',') df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()

3条回答

网友

1楼 · 编辑于 2024-05-19 01:53:44

根据我的测试，在计数之后进行regex清理要快得多。你知道吗

from itertools import chain
import re

p = re.compile("""^u['"](.*)['"]$""")
ser = pd.Series(list(chain.from_iterable(
    x.title().split(', ') for x in df.actors_list.str[1:-1]))).value_counts()
ser.index = [p.sub(r"\1", x) for x in ser.index.tolist()]


ser.head()

Robert De Niro    18
Brad Pitt         14
Clint Eastwood    14
Tom Hanks         14
Al Pacino         13
dtype: int64

网友

2楼 · 编辑于 2024-05-19 01:53:44

我将使用ast将列表转换为list

import ast 
df.actors_list=df.actors_list.apply(ast.literal_eval)
pd.DataFrame(df.actors_list.tolist()).melt().value.value_counts()

网友

3楼 · 编辑于 2024-05-19 01:53:44

使用普通python总比依赖pandas好，因为如果列表很大，它会消耗大量内存。你知道吗

如果列表的大小为1000，那么在使用expand = True时，非1000长度的列表将具有Nan，这是对内存的浪费。试试这个。你知道吗

df = pd.concat([df]*1000) # For the sake of large df. 

%%timeit
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
10 loops, best of 3: 65.9 ms per loop

%%timeit     
df['actors_list'] = df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',')
10 loops, best of 3: 24.1 ms per loop

%%timeit
words = {}
for i in df['actors_list']:
    for w in i : 
        if w in words:
            words[w]+=1
        else:
            words[w]=1

100 loops, best of 3: 5.44 ms per loop

相关问题更多 >

编程相关推荐

热门问题

热门文章