从具有不同模式的字符串中提取特定信息

2024-05-16 00:56:48 发布

您现在位置:Python中文网/ 问答频道 /正文

import pandas as pd
df = pd.DataFrame({'Reference':["PO: TK42-8", 
                                "PO GQ5-42", 
                                "PO:HEA-238/239", 
                                "PO: 4501005609  Purchaser: Mariana Toledo Blanco", 
                                "FITN7-26", 
                                "PO#CP4-62",
                                "PO 4501004752  Purchaser Yang Gao / Split from S94964",
                                "GUANGDONG YOULONG ELECTRICAL APPLIANCES CO.,LTD // PO#GQY6-17"]
                   })

从上面的df中,我已经尝试了一段时间,以最小的成功率提取两条信息(如果可用)。从而创建2个新列,如下面所需的df所示

df2 = pd.DataFrame({'Reference':["PO: TK42-8", 
                                "PO GQ5-42", 
                                "PO:HEA-238/239", 
                                "PO: 4501005609  Purchaser: Mariana Toledo Blanco", 
                                "FITN7-26", 
                                "PO#CP4-62",
                                "PO 4501004752  Purchaser Yang Gao / Split from S94964",
                                "GUANGDONG YOULONG ELECTRICAL APPLIANCES CO.,LTD // PO#GQY6-17"],
                    
                    "PO":["TK42-8", "GQ5-42", "HEA-238/239", "4501005609", "FITN7-26","CP4-62", "4501004752", "GQY6-17" ],
                    "Purchaser":["", "", "", "Mariana Toledo Blanco", "","", "Yang Gao", "" ],
                   })

到目前为止,我在以下方面取得了一些成功:

df['PO'] = df['Reference'].str.extract(r"PO:.*?([ \w.\S-]+)")
df['Purchaser'] = df['Reference'].str.extract(r"Purchaser.*?([ \w.*]+)")

但是,我不知道如何正确地为每个函数括号中的每种情况编写脚本


Tags: dfpopdreferenceyanggaoblancopurchaser
1条回答
网友
1楼 · 发布于 2024-05-16 00:56:48

>>> df['Reference'].str.extract(r"(?:^(?=[A-Z\d/-]+$)|\bPO\W*)([A-Z\d/-]+)")
             0
0       TK42-8
1       GQ5-42
2  HEA-238/239
3   4501005609
4     FITN7-26
5       CP4-62
6   4501004752
7      GQY6-17

解释

                                        
  (?:                      group, but do not capture:
                                        
    ^                        the beginning of the string
                                        
    (?=                      look ahead to see if there is:
                                        
      [A-Z\d/-]+               any character of: 'A' to 'Z', digits
                               (0-9), '/', '-' (1 or more times
                               (matching the most amount possible))
                                        
      $                        before an optional \n, and the end of
                               the string
                                        
    )                        end of look-ahead
                                        
   |                        OR
                                        
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
                                        
    PO                       'PO'
                                        
    \W*                      non-word characters (all but a-z, A-Z, 0-
                             9, _) (0 or more times (matching the
                             most amount possible))
                                        
  )                        end of grouping
                                        
  (                        group and capture to \1:
                                        
    [A-Z\d/-]+               any character of: 'A' to 'Z', digits (0-
                             9), '/', '-' (1 or more times (matching
                             the most amount possible))
                                        
  )                        end of \1

>>> df['Reference'].str.extract(r"\bPurchaser\W+(\w(?:[\s\w]*\w)?)").fillna("")
                       0
0                       
1                       
2                       
3  Mariana Toledo Blanco
4                       
5                       
6               Yang Gao
7                       

解释

                                        
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
                                        
  Purchaser                'Purchaser'
                                        
  \W+                      non-word characters (all but a-z, A-Z, 0-
                           9, _) (1 or more times (matching the most
                           amount possible))
                                        
  (                        group and capture to \1:
                                        
    \w                       word characters (a-z, A-Z, 0-9, _)
                                        
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
                                        
      [\s\w]*                  any character of: whitespace (\n, \r,
                               \t, \f, and " "), word characters (a-
                               z, A-Z, 0-9, _) (0 or more times
                               (matching the most amount possible))
                                        
      \w                       word characters (a-z, A-Z, 0-9, _)
                                        
    )?                       end of grouping
                                        
  )                        end of \1

相关问题 更多 >