用于解析体育逐步数据的自然语言解析器

9 投票

2 回答

877 浏览

提问于 2025-04-17 06:40

我正在尝试为足球比赛设计一个解析器。这里提到的“自然语言”这个词我用得很宽泛，所以请多包涵，因为我对这个领域几乎一无所知。

以下是我正在处理的一些例子（格式：时间|进攻回合和距离|进攻队伍|描述）：

04:39|4th and 20@NYJ46|Dal|Mat McBriar punts for 32 yards to NYJ14. Jeremy Kerley - no return. FUMBLE, recovered by NYJ.|
04:31|1st and 10@NYJ16|NYJ|Shonn Greene rush up the middle for 5 yards to the NYJ21. Tackled by Keith Brooking.|
03:53|2nd and 5@NYJ21|NYJ|Mark Sanchez rush to the right for 3 yards to the NYJ24. Tackled by Anthony Spencer. FUMBLE, recovered by NYJ (Matthew Mulligan).|
03:20|1st and 10@NYJ33|NYJ|Shonn Greene rush to the left for 4 yards to the NYJ37. Tackled by Jason Hatcher.|
02:43|2nd and 6@NYJ37|NYJ|Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins.|
02:02|1st and 10@NYJ44|NYJ|Shonn Greene rush to the right for 1 yard to the NYJ45. Tackled by Anthony Spencer.|
01:23|2nd and 9@NYJ45|NYJ|Mark Sanchez pass to the left to LaDainian Tomlinson for 5 yards to the 50. Tackled by Sean Lee.|

到目前为止，我写了一个简单的解析器，可以处理所有简单的内容（比如比赛ID、节数、时间、进攻回合和距离、进攻队伍），还有一些脚本可以获取这些数据，并把它们整理成上面看到的格式。一行数据会被转化成一个“比赛”对象，存储到数据库里。

对我来说，最难的部分是解析比赛描述。以下是我想从这个字符串中提取的一些信息：

示例字符串：

"Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins."

结果：

turnover = False
interception = False
fumble = False
to_on_downs = False
passing = True
rushing = False
direction = 'left'
loss = False
penalty = False
scored = False
TD = False
PA = False
FG = False
TPC = False
SFTY = False
punt = False
kickoff = False
ret_yardage = 0
yardage_diff = 7
playmakers = ['Mark Sanchez', 'Shonn Greene', 'Mike Jenkins']

我最初的解析器逻辑大致是这样的：

# pass, rush or kick
# gain or loss of yards
# scoring play
    # Who scored? off or def?
    # TD, PA, FG, TPC, SFTY?
# first down gained
# punt?
# kick?
    # return yards?
# penalty?
    # def or off?
# turnover?
    # INT, fumble, to on downs?
# off play makers
# def play makers

描述可能会变得相当复杂（比如多个掉球和回收，还有罚球等），我在想是否可以利用一些自然语言处理（NLP）模块。可能我会花几天时间在一个简单的、静态的状态机解析器上，但如果有人有关于如何使用NLP技术的建议，我很想听听。

信息提取状态机自然语言处理数据库存储解析器设计体育数据解析足球比赛分析进攻回合

2 个回答

我觉得pyparsing这个工具应该能很好地工作，但基于规则的系统有点脆弱。所以，如果你想处理的内容超出了足球的范围，可能会遇到一些麻烦。

我认为在这种情况下，使用一个词性标注器和一个包含球员名字、位置以及其他体育术语的词典会更好。把这些放进你喜欢的机器学习工具里，找出好的特征，我觉得效果会不错。

NTLK是开始学习自然语言处理的一个不错的地方。可惜的是，这个领域发展得还不够完善，目前还没有一个工具能轻松解决所有问题。

回答于 2025-04-17 由 Python大师

分享举报

我觉得pyparsing在这里会非常有用。

你的输入文本看起来很规整（和真正的自然语言不太一样），而pyparsing在处理这种情况时表现得很好。你可以去看看这个工具。

比如说，要解析以下句子：

Mat McBriar punts for 32 yards to NYJ14.
Mark Sanchez rush to the right for 3 yards to the NYJ24.

你可以用类似这样的方式来定义一个解析句子（具体的语法可以查文档）：

name = Group(Word(alphas) + Word(alphas)).setResultsName('name')

action = Or(Exact("punts"),Exact("rush")).setResultsName('action') + Optional(Exact("to the")) + Or(Exact("left"), Exact("right")) )

distance = Word(number).setResultsName("distance") + Exact("yards")

pattern = name + action + Exact("for") +  distance + Or(Exact("to"), Exact("to the")) + Word()

然后pyparsing会根据这个模式来拆分字符串。它还会返回一个字典，里面包含了从句子中提取的项目名称、动作和距离。

回答于 2025-04-17 由 Python大师

分享举报

用于解析体育逐步数据的自然语言解析器

2 个回答

撰写回答