如何用regex正确解析复杂字符串中的元素

2024-04-23 06:27:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我有数据,可以在某些格式,我有困难正确解析。最初我用的是重新拆分为了在句点上分开并有条件地连接某些元素,然而这产生了额外的问题,我认为可以用regex解决,但是我不知道如何正确地格式化它。你知道吗

数据可以采用以下格式

STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3
STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2
STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2

我遇到的问题是,使用regex根据句点和斜杠进行拆分,这意味着如果varaible前面有句点,它就不包括句点。如果变量前面有一个句点,我希望它能够以字符串的形式出现,例如var=“.VARIABLE1”,同时var=”变量。变量2". 我不需要存储静态字段,我只需要能够提取变量字段,不管是否有一个、两个或一个前面有文本句点。你知道吗

我试过使用检索,但只能获得第一个静态场。 我试过使用重新拆分('.|/',line)但是我遇到了无法解析前面有句点的变量的问题,比如“.car”而不是“car”,或者我必须手动将有两个变量的变量与['...join(x[2:4])]连接起来,这是我不想做的,因为字段总数的可变性。你知道吗

对于给定的示例,我期望的输出是两个单独的变量,它们保存着输入中的变量

x = VARIABLE1 y = VARIABLE2
x = VARIABLE1.VARIABLE2 y = VARIABLE3
x = .VARIABLE1 y = VARIABLE2
x = VARIABLE1 y = VARIABLE2
x = .VARIABLE1 y= VARIABLE2

    x = re.split('\/', r)
    numElements = len(x)
    if(x[(numElements - 2)] == "STATICFIELD2"):
        y[x[2]] = 1
        else:
            x[2:4] = ['.'.join(x[2:4])]
        y[x[2]] = 1
    x = re.search(r'(\bSTATICFIELD1.STATICFIELD2.\b+)(\b.STATICFIELD3/\b)',line)

Tags: 数据var格式line静态carregexjoin
2条回答

您可以从字符串中删除STATICFIELD模式,然后在斜杠上进行简单拆分:

import re

def splitXY(s) : return re.sub("(\.?STATICFIELD\d+\.?)","",s).split("/")

x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y)  # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3")
print(x,y)  # VARIABLE1.VARIABLE2 VARIABLE3
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y)  # .VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2")
print(x,y)  # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2")
print(x,y)  # .VARIABLE1 VARIABLE2

[更新]

如果您有一些逻辑允许您区分静态字段的名称和变量的名称,您可以使用split和join来解析字符串:

def isStatic(name): # this would be whatever logic distinguishes the names
    return name != "" and name.startswith("STATICFIELD")

def splitXY(s) :
    x,y = s.split("/")
    x =  ".".join(name for name in x.split(".") if not isStatic(name))
    return x,y

x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y)  # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3")
print(x,y)  # VARIABLE1.VARIABLE2 VARIABLE3
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y)  # .VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2")
print(x,y)  # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2")
print(x,y)  # .VARIABLE1 VARIABLE2

确保isStatic()False响应空名称

所以,对于我所问的问题,我怀疑你被拒绝是因为当你说VARIABLESTATICFIELD时,有人认为你是字面意思,因为如果你是你,很可能会考虑改用findall。你知道吗

如果这是你所需要的,下面应该工作,然后你可以处理它

编辑:选项1

>>> string = '''STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3
STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2
STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2'''



>>> def isolate_variables(string):
        import re
        result = []
        for line in string.split('\n'):
            x,y = re.findall('(?i)(?:(?<=\s|\.|\/)|(?<=^))(VARIABLE[\d]+?[\.]+(?:VARIABLE[\d]*)+|(?:(?<=\s|\.|\/)|(?<=^))[\.]*VARIABLE[\d]+?)(?=[\.\/\n\ ]|$)', line)
            result.append((x,y))
        print(result)
        return result



>>> isolate_variables(string)



#OUTPUT
[('VARIABLE1', 'VARIABLE2'), ('VARIABLE1.VARIABLE2', 'VARIABLE3'), ('.VARIABLE1', 'VARIABLE2'), ('VARIABLE1', 'VARIABLE2'), ('.VARIABLE1', 'VARIABLE2')]

是的。你知道吗

选项2-您只需在

>>> import re


>>> string = '''STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3
STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2
STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2'''


>>> re.findall('(?i)(?:(?<=\s|\.|\/)|(?<=^))(VARIABLE[\d]+?[\.]+(?:VARIABLE[\d]*)+|(?:(?<=\s|\.|\/)|(?<=^))[\.]*VARIABLE[\d]+?)(?=[\.\/\n\ ]|$)', string)



#OUTPUT
['VARIABLE1', 'VARIABLE2', 'VARIABLE1.VARIABLE2', 'VARIABLE3', '.VARIABLE1', 'VARIABLE2', 'VARIABLE1', 'VARIABLE2', '.VARIABLE1', 'VARIABLE2']

相关问题 更多 >