解析自定义文件格式的技巧
抱歉标题有点模糊,但我真的不知道该怎么简洁地描述这个问题。
我创建了一个(或多或少)简单的领域特定语言,我打算用它来指定对不同实体(通常是从网页提交的表单)应用哪些验证规则。帖子底部有一个示例,展示了这个语言的样子。
我的问题是,我不知道该如何开始解析这个语言,把它转化成我可以使用的形式(我会用Python来进行解析)。我的目标是得到一个规则/过滤器的列表(以字符串形式,包括参数,比如'cocoa(99)'
),这些规则应该按顺序应用到每个对象/实体上(也以字符串形式,比如'chocolate'
,'chocolate.lindt'
等)。
我不确定该用什么技术来开始,甚至不知道针对这种问题有哪些技术可用。你觉得最好的方法是什么?我并不想要一个完整的解决方案,只是想要一个大致的方向。
谢谢。
语言示例文件:
# Comments start with the '#' character and last until the end of the line
# Indentation is significant (as in Python)
constant NINETY_NINE = 99 # Defines the constant `NINETY_NINE` to have the value `99`
*: # Applies to all data
isYummy # Everything must be yummy
chocolate: # To validate, say `validate("chocolate", object)`
sweet # chocolate must be sweet (but not necessarily chocolate.*)
lindt: # To validate, say `validate("chocolate.lindt", object)`
tasty # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)
*: # Applies to all data under chocolate.lindt
smooth # Could also be written smooth()
creamy(1) # Level 1 creamy
dark: # dark has no special validation rules
extraDark:
melt # Filter that modifies the object being examined
c:bitter # Must be bitter, but only validated on client
s:cocoa(NINETY_NINE) # Must contain 99% cocoa, but only validated on server. Note constant
milk:
creamy(2) # Level 2 creamy, overrides creamy(1) of chocolate.lindt.* for chocolate.lindt.milk
creamy(3) # Overrides creamy(2) of previous line (all but the last specification of a given rule are ignored)
ruleset food: # To define a chunk of validation rules that can be expanded from the placeholder `food` (think macro)
caloriesWithin(10, 2000) # Unlimited parameters allowed
edible
leftovers: # Nested rules allowed in rulesets
stale
# Rulesets may be nested and/or include other rulesets in their definition
chocolate: # Previously defined groups can be re-opened and expanded later
ferrero:
hasHazelnut
cake:
tasty # Same rule used for different data (see chocolate.lindt)
isLie
ruleset food # Substitutes with rules defined for food; cake.leftovers must now be stale
pasta:
ruleset food # pasta.leftovers must also be stale
# Sample use (in JavaScript):
# var choc = {
# lindt: {
# cocoa: {
# percent: 67,
# mass: '27g'
# }
# }
# // Objects/groups that are ommitted (e.g. ferrro in this example) are not validated and raise no errors
# // Objects that are not defined in the validation rules do not raise any errors (e.g. cocoa in this example)
# };
# validate('chocolate', choc);
# `validate` called isYummy(choc), sweet(choc), isYummy(choc.lindt), smooth(choc.lindt), creamy(choc.lindt, 1), and tasty(choc.lindt) in that order
# `validate` returned an array of any validation errors that were found
# Order of rule validation for objects:
# The current object is initially the object passed in to the validation function (second argument).
# The entry point in the rule group hierarchy is given by the first argument to the validation function.
# 1. First all rules that apply to all objects (defined using '*') are applied to the current object,
# starting with the most global rules and ending with the most local ones.
# 2. Then all specific rules for the current object are applied.
# 3. Then a depth-first traversal of the current object is done, repeating steps 1 and 2 with each object found as the current object
# When two rules have equal priority, they are applied in the order they were defined in the file.
# No need to end on blank line
6 个回答
如果你想学习解析(也就是把数据分解成更容易理解的部分),我强烈推荐一个面向对象风格的库,比如PyParsing。虽然它的速度没有一些更复杂的工具,比如antler、lex和yac快,但你可以很快开始进行解析工作。
这段话不是在教你解析的知识,但你的格式和合法的YAML格式非常接近。你可以考虑把你的语言重新定义为YAML的一部分,然后使用一个标准的YAML解析器来处理它。
首先,如果你想了解解析(也就是把代码转换成计算机能理解的形式),那么可以尝试自己写一个递归下降解析器。你定义的语言只需要几个简单的规则。我建议你使用Python的tokenize
库,这样可以省去把一串字节转换成一串标记的无聊工作。
如果你想了解一些实用的解析方法,继续往下看……
一个简单粗暴的解决办法就是直接用Python来实现:
NINETY_NINE = 99 # Defines the constant `NINETY_NINE` to have the value `99`
rules = {
'*': { # Applies to all data
'isYummy': {}, # Everything must be yummy
'chocolate': { # To validate, say `validate("chocolate", object)`
'sweet': {}, # chocolate must be sweet (but not necessarily chocolate.*)
'lindt': { # To validate, say `validate("chocolate.lindt", object)`
'tasty':{} # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)
'*': { # Applies to all data under chocolate.lindt
'smooth': {} # Could also be written smooth()
'creamy': 1 # Level 1 creamy
},
# ...
}
}
}
有几种方法可以实现这个技巧,比如,这里有一个更干净(虽然有点不寻常)的方法,使用类来实现:
class _:
class isYummy: pass
class chocolate:
class sweet: pass
class lindt:
class tasty: pass
class _:
class smooth: pass
class creamy: level = 1
# ...
作为实现完整解析器的一个中间步骤,你可以使用Python自带的解析器,它可以解析Python的语法并返回一个抽象语法树(AST)。这个AST层次很深,有很多(我觉得)不必要的层级。你可以通过去掉那些只有一个孩子节点的节点,来简化这个结构。用这种方法你可以做到类似这样的事情:
import parser, token, symbol, pprint
_map = dict(token.tok_name.items() + symbol.sym_name.items())
def clean_ast(ast):
if not isinstance(ast, list):
return ast
elif len(ast) == 2: # Elide single-child nodes.
return clean_ast(ast[1])
else:
return [_map[ast[0]]] + [clean_ast(a) for a in ast[1:]]
ast = parser.expr('''{
'*': { # Applies to all data
isYummy: _, # Everything must be yummy
chocolate: { # To validate, say `validate("chocolate", object)`
sweet: _, # chocolate must be sweet (but not necessarily chocolate.*)
lindt: { # To validate, say `validate("chocolate.lindt", object)`
tasty: _, # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)
'*': { # Applies to all data under chocolate.lindt
smooth: _, # Could also be written smooth()
creamy: 1 # Level 1 creamy
}
# ...
}
}
}
}''').tolist()
pprint.pprint(clean_ast(ast))
不过,这种方法也有它的局限性。最终的AST还是有点杂乱,而且你定义的语言必须能被理解为有效的Python代码。例如,你不能支持这样的语法……
*:
isYummy
……因为这种语法不能被解析为Python代码。不过,它的一个大优点是,你可以控制AST的转换,所以不可能注入任意的Python代码。