用正则表达式分析.srt文件

网友

1楼 · 编辑于 2024-06-06 05:35:56

老实说，我不认为有任何理由把regex放到这个问题上。.srt文件是highly structured。结构如下：

从1开始的整数，单调递增
开始-->停止计时
一行或多行字幕内容
空行

。。。然后重复。请注意粗体部分-您可能需要在时间代码后捕获1、2或20行字幕内容。在

所以，好好利用这个结构。通过这种方式，您可以在一个过程中解析所有内容，而不需要一次将多行放入内存中，并且仍然将每个字幕的所有信息保存在一起。在

from itertools import groupby
# "chunk" our input file, delimited by blank lines
with open(filename) as f:
    res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]

例如，使用SRT doc页面上的示例，我得到：

^{pr2}$

我可以进一步将其转化为一系列有意义的对象：

from collections import namedtuple

Subtitle = namedtuple('Subtitle', 'number start end content')

subs = []

for sub in res:
    if len(sub) >= 3: # not strictly necessary, but better safe than sorry
        sub = [x.strip() for x in sub]
        number, start_end, *content = sub # py3 syntax
        start, end = start_end.split(' --> ')
        subs.append(Subtitle(number, start, end, content))

subs
Out[65]: 
[Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
 Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]

网友

2楼 · 编辑于 2024-06-06 05:35:56

{1: 时间：
^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$
字符串：*[a-zA-Z]+*

希望这能有所帮助。在

网友

3楼 · 编辑于 2024-06-06 05:35:56

不同意@roippi。Regex是一个非常好的文本匹配解决方案。这个解决方案的正则表达式并不复杂。在

import re   

f = file.open(yoursrtfile)
# Parse the file content
content = f.read()
# Find all result in content
# The first big (__) retrieve the timing, \s+ match all timing in between,
# The (.+) means retrieve any text content after that.
result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content)
# Just print out the result list. I recommend you do some formatting here.
print result

相关问题更多 >

编程相关推荐

热门问题

热门文章