从dataframe中访问dataframe中的列表。

import pandas as pd # Typical data example: data = {'tchname': ['MISS NANDA DEVI', 'RAJIK HUSSAIN-III', 'MAJJI VENKATA KANAKA DURGA RANI']} df = pd.DataFrame(data) # Split words in teacher names into list. df['tchname'] = df['tchname'].str.split() # Extract first word from tchname lists. df['firstname'] = df['tchname'].str[0].str.title() # If firstname matches item in honorific, replace with second tchname entry df['placeholder'] = df['tchname'].str[1].str.title() honorific = ['Dr', 'Miss', 'Mr', 'Mrs', 'Ms'] df.loc[df['firstname'].isin(honorific), 'firstname'] \ = df.loc[df['firstname'].isin(honorific), 'placeholder'] df = df.drop(columns='placeholder') # Extract last name from tchname lists. df['surname'] = df['tchname'].str[-1].str.title()

2条回答

网友

1楼 · 编辑于 2024-06-10 16:59:33

感谢Alexander Cécile提出的使用regex的建议。由于regex的性能很差，我试图避免这种情况，但是这里有一个基于它的解决方案：

import numpy as np
import pandas as pd

# Typical data example:
data = {'tchname': ['MISS NANDA DEVI', 'RAJIK HUSSAIN-III',
                    'MAJJI VENKATA KANAKA DURGA RANI']}
df = pd.DataFrame(data)

# Set firstname to first or second word of tchname based on honorific presence.
df['firstname'] = np.where(df['tchname'].str.match(
    '^(Dr|Miss|Mr|Mrs|Ms) ', case=False),
    df['tchname'].str.split().str[1].str.capitalize(),
    df['tchname'].str.split().str[0].str.capitalize())

df['surname'] = df['tchname'].str.split().str[-1].str.capitalize()

我想说的是，代码显然更清晰，从可维护性的角度来看，这可能是一个不错的解决方案，但是正如预期的那样，它比原来运行的慢（在我的机器上使用大数据集时，问题代码的执行时间约为6.3秒，而在我的机器上，问题代码的执行时间约为5.4秒），因此，除非没有更好的选择，否则我不会接受这个答案。你知道吗

网友

2楼 · 编辑于 2024-06-10 16:59:33

这是一个（相对的？）简单正则表达式解决方案。在这种情况下，它应该与^{}一起使用。它将接受任何非空白字符作为名称的一部分，它可以，而且应该进一步专门化。你知道吗

^(?:(?:Dr|Miss|Mr|Mrs|Ms)\s+)?(\S+)(?:.*)\s+(\S+)$

别忘了旗子！你知道吗

re.IGNORECASE | re.UNICODE

我将研究以编程方式创建regex，因为如果荣誉/头衔的数量增加，事情可能会变得烦人。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章