如何从dataframe列值创建单独的子字符串列

3条回答

网友

1楼 · 编辑于 2024-06-01 04:11:40

您可以使用str.extract并将.astype应用于结果，以获得所需的列和作为浮点的特定数字列：

separated = df.Name.str.extract(r"""(?ix)
    (?P<Symbol>[a-z]+)     # all letters up to a date that matches
    (?P<Month>\d{2}\w{3})  # the date (2 numbers then 3 letters)
    (?P<SP>.*?)            # everything until the "type"
    (?P<Type>\w{2}$)       # Last two characters of string is the type
""").astype({'SP': 'float'})

这将给你：

    Symbol  Month     SP Type
0  INFOSYS  18SEP  640.5   PE
1  INFOSYS  18SEP  640.5   PE
2     BHEL  18SEP   52.8   CE
3     BHEL  18SEP   52.8   CE
4     IOCL  18SEP  640.0   PE
5     IOCL  18SEP  640.0   PE

然后应用df.join(separated)得到您的最终DF：

     Instru                  Name   Symbol  Month     SP Type
0  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.5   PE
1  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.5   PE
2  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.8   CE
3  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.8   CE
4  16834306        IOCL18SEP640PE     IOCL  18SEP  640.0   PE
5  16834306        IOCL18SEP640PE     IOCL  18SEP  640.0   PE

网友

2楼 · 编辑于 2024-06-01 04:11:40

您可以定义分割函数并创建所需的输出

def f(x):
    for i, c in enumerate(x):
        if c.isdigit():        
            break
    return [x[0:i], x[i:9], x[9:-2], x[-2:]]

df[['Symbol','Month','SP','Type']] = pd.DataFrame(df.Name.apply(f).tolist())

     Instru               Name Symbol  Month      SP Type
0  16834306  INFY18SEP640.50PE   INFY  18SEP  640.50   PE
1  16834306  INFY18SEP640.50PE   INFY  18SEP  640.50   PE
2  16834306   BHEL18SEP52.80CE   BHEL  18SEP   52.80   CE
3  16834306   BHEL18SEP52.80CE   BHEL  18SEP   52.80   CE
4  16834306     IOCL18SEP640PE   IOCL  18SEP     640   PE
5  16834306     IOCL18SEP640PE   IOCL  18SEP     640   PE

网友

3楼 · 编辑于 2024-06-01 04:11:40

对正则表达式模式中的命名组使用^{}

pat = '(?P<Symbol>.*?)(?P<Month>\d{1,2}\w{3})(?P<SP>[\d\.]+)(?P<Type>.*)'
df.join(df.Name.str.extract(pat))

     Instru                  Name   Symbol  Month      SP Type
0  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.50   PE
1  16834306  INFOSYS18SEP640.50PE  INFOSYS  18SEP  640.50   PE
2  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.80   CE
3  16834306      BHEL18SEP52.80CE     BHEL  18SEP   52.80   CE
4  16834306        IOCL18SEP640PE     IOCL  18SEP     640   PE
5  16834306        IOCL18SEP640PE     IOCL  18SEP     640   PE

regex模式的解释

正则表达式是一个有趣的模糊业务，是一种艺术形式。我会解释我做了什么以及为什么。你可以比较一下我相对于@jonclements所做的工作，发现我们都用相同的方法解决了这个问题，但做出了微妙的不同假设。你知道吗

'(?P<group_name>pattern)'是一种创建捕获组并用'group_name'命名的方法
'(?P<Symbol>.*?)'抓取到下一个捕获组的所有字符，'?'表示不要贪心。你知道吗
'(?P<Month>\d{1,2}\w{3})'抓取1或2个数字，然后抓取3个字母。1或2个数字的模糊性是我使前一组不贪婪的原因。你知道吗
'(?P<SP>[\d\.]+)'获取一个或多个数字或句点。诚然，这并不是非常优雅，因为它可以抓住'4.2.4.5'，但它应该完成这项工作。你知道吗
'(?P<Type>.*)'玩清理和抓住其余的。你知道吗

regex模式的解释

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从dataframe列值创建单独的子字符串列

regex模式的解释

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >