使用regex将数据从字符串移动到数据帧？

3条回答

网友

1楼 · 编辑于 2024-04-26 05:17:37

您可以通过在()之间进行拆分和提取来稍微操纵字符串。需要首先在“（”上拆分以删除前两级嵌套。你知道吗

import pandas as pd

s = df.col.str.split('(', n=2)
df['Names'] = s.str[1].str.split().str[1]

s2 = s.str[2].str.extractall('[(](.*?)[)]')[0].str.split()

df = pd.concat([df, (pd.DataFrame(s2.values.tolist(), index=s2.index.get_level_values(0))
                       .pivot(columns=0, values=1))], axis=1)

输出：

                                                 col  Names code label type    x    y
0  (Names RED (property (x 123) (y 456) (type MT)...    RED  XYZ   ONE   MT  123  456
1  (Names GREEN (property (type MX) (label TWO) (...  GREEN  NaN   TWO   MX  789  101

网友

2楼 · 编辑于 2024-04-26 05:17:37

一个非常基本和直接的实现（只是向您展示，您可以在提出问题之前从这里开始，并获得更多的可信度）：

string1 = "(Names RED (property (x 123) (y 456) (type MT) (label ONE) (code XYZ)))"
string2 = "(Names GREEN (property (type MX) (label TWO) (x 789) (y 101)))"

names = []
x = []
y = []
label = []
code = []
split_string = string2.split(' ')

for i in range(0, len(split_string)):
    try:
        if "Names" in split_string[i]:
            names.append(split_string[i+1])
        if "x" in split_string[i]:
            x.append(split_string[i+1][:-1])
        if "y" in split_string[i] and split_string[i].find("y") <= 1:
            y.append(split_string[i+1][:-1])
        if "label" in split_string[i]:
            label.append(split_string[i+1][:-1])
        if "code" in split_string[i]:
            code.append(split_string[i+1][:-1])
    except IndexError:
        break
print(names, '\n', x, '\n', y, '\n', label, '\n', code, '\n')

输出（字符串1）：

['GREEN'] 
['789'] 
['101))'] 
['TWO'] 
[]

输出（字符串2）：

['RED'] 
['123'] 
['456'] 
['ONE'] 
['XYZ))']

网友

3楼 · 编辑于 2024-04-26 05:17:37

这个模式是有规律的，除了任何顺序的属性之外，所以它肯定是可行的。我分两步完成了这项工作—一步是正则表达式获取开头的颜色并提取属性字符串，另一步是提取属性。你知道吗

import re


inputs = [
'(Names RED (property (x 123) (y 456) (type MT) (label ONE) (code XYZ)))',
'(Names GREEN (property (type MX) (label TWO) (x 789) (y 101)))'
]

# Get the initial part, and chop off the property innerstring
initial_re = re.compile('^\(Names\s([^\s]*)\s\(property\s(.*)\)\)')
# Get all groups from (x 123) (y 456) (type MT) (label ONE) (code XYZ)
prop_re = re.compile('\(([^\s]*)\s([^\s]*)\)')

for s in inputs:
    parts = initial_re.match(s)
    color = parts.group(1)
    props = parts.group(2)
    # e.g. (x 123) (y 456) (type MT) (label ONE) (code XYZ)
    properties = prop_re.findall(props)
    # [('x', '123'), ('y', '456'), ('type', 'MT'), ('label', 'ONE'), ('code', 'XYZ')]
    print("%s: %s" % (color, properties))

给出的输出是

RED: [('x', '123'), ('y', '456'), ('type', 'MT'), ('label', 'ONE'), ('code', 'XYZ')]
GREEN: [('type', 'MX'), ('label', 'TWO'), ('x', '789'), ('y', '101')]

要将其放入pandas，可以在列表字典中累积属性（我在下面使用了defaultdict）。您需要为空值存储一些内容，以便所有列的长度相同，这里我只存储None（或null）。最后用pd.DataFrame.from_dict得到你的最终DataFrame。你知道吗

import re
import pandas as pd
from collections import defaultdict

inputs = [
'(Names RED (property (x 123) (y 456) (type MT) (label ONE) (code XYZ)))',
'(Names GREEN (property (type MX) (label TWO) (x 789) (y 101)))'
]

# Get the initial part, and chop off the property innerstring
initial_re = re.compile('^\(Names\s([^\s]*)\s\(property\s(.*)\)\)')
# Get all groups from (x 123) (y 456) (type MT) (label ONE) (code XYZ)
prop_re = re.compile('\(([^\s]*)\s([^\s]*)\)')

columns = ['color', 'x', 'y', 'type', 'label', 'code']

data_dict = defaultdict(list)

for s in inputs:
    parts = initial_re.match(s)
    color = parts.group(1)
    props = parts.group(2)
    # e.g. (x 123) (y 456) (type MT) (label ONE) (code XYZ)
    properties = dict(prop_re.findall(props))
    properties['color'] = color

    for k in columns:
        v = properties.get(k)  # None if missing
        data_dict[k].append(v)


pd.DataFrame.from_dict(data_dict)

最终输出为

   color    x    y type label  code
0    RED  123  456   MT   ONE   XYZ
1  GREEN  789  101   MX   TWO  None

输出：

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用regex将数据从字符串移动到数据帧？

输出：

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >