从具有不同模式的字符串中提取特定信息

import pandas as pd df = pd.DataFrame({'Reference':["PO: TK42-8", "PO GQ5-42", "PO:HEA-238/239", "PO: 4501005609 Purchaser: Mariana Toledo Blanco", "FITN7-26", "PO#CP4-62", "PO 4501004752 Purchaser Yang Gao / Split from S94964", "GUANGDONG YOULONG ELECTRICAL APPLIANCES CO.,LTD // PO#GQY6-17"] })

df2 = pd.DataFrame({'Reference':["PO: TK42-8", "PO GQ5-42", "PO:HEA-238/239", "PO: 4501005609 Purchaser: Mariana Toledo Blanco", "FITN7-26", "PO#CP4-62", "PO 4501004752 Purchaser Yang Gao / Split from S94964", "GUANGDONG YOULONG ELECTRICAL APPLIANCES CO.,LTD // PO#GQY6-17"], "PO":["TK42-8", "GQ5-42", "HEA-238/239", "4501005609", "FITN7-26","CP4-62", "4501004752", "GQY6-17" ], "Purchaser":["", "", "", "Mariana Toledo Blanco", "","", "Yang Gao", "" ], })

1条回答

网友

1楼 · 发布于 2024-05-16 00:56:48

用

>>> df['Reference'].str.extract(r"(?:^(?=[A-Z\d/-]+$)|\bPO\W*)([A-Z\d/-]+)")
             0
0       TK42-8
1       GQ5-42
2  HEA-238/239
3   4501005609
4     FITN7-26
5       CP4-62
6   4501004752
7      GQY6-17

解释

                                        
  (?:                      group, but do not capture:
                                        
    ^                        the beginning of the string
                                        
    (?=                      look ahead to see if there is:
                                        
      [A-Z\d/-]+               any character of: 'A' to 'Z', digits
                               (0-9), '/', '-' (1 or more times
                               (matching the most amount possible))
                                        
      $                        before an optional \n, and the end of
                               the string
                                        
    )                        end of look-ahead
                                        
   |                        OR
                                        
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
                                        
    PO                       'PO'
                                        
    \W*                      non-word characters (all but a-z, A-Z, 0-
                             9, _) (0 or more times (matching the
                             most amount possible))
                                        
  )                        end of grouping
                                        
  (                        group and capture to \1:
                                        
    [A-Z\d/-]+               any character of: 'A' to 'Z', digits (0-
                             9), '/', '-' (1 or more times (matching
                             the most amount possible))
                                        
  )                        end of \1

用

>>> df['Reference'].str.extract(r"\bPurchaser\W+(\w(?:[\s\w]*\w)?)").fillna("")
                       0
0                       
1                       
2                       
3  Mariana Toledo Blanco
4                       
5                       
6               Yang Gao
7

解释

                                        
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
                                        
  Purchaser                'Purchaser'
                                        
  \W+                      non-word characters (all but a-z, A-Z, 0-
                           9, _) (1 or more times (matching the most
                           amount possible))
                                        
  (                        group and capture to \1:
                                        
    \w                       word characters (a-z, A-Z, 0-9, _)
                                        
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
                                        
      [\s\w]*                  any character of: whitespace (\n, \r,
                               \t, \f, and " "), word characters (a-
                               z, A-Z, 0-9, _) (0 or more times
                               (matching the most amount possible))
                                        
      \w                       word characters (a-z, A-Z, 0-9, _)
                                        
    )?                       end of grouping
                                        
  )                        end of \1

相关问题更多 >

编程相关推荐

热门问题

热门文章