<p>我写了一个解决方案,假设<em>如果我有两个连续的开始,我总是可以删除第一个,如果它们是按日期时间排序的</em></p>
<p>我稍微修改了输入文件,将<code>date ad time</code>替换为一个数字。该代码可以很容易地适应管理日期和时间。你知道吗</p>
<p>代码分为三个部分:</p>
<ol>
<li>读取文件并将其解析为有用的数据结构</li>
<li>对每个会话的数据进行排序</li>
<li>删除不需要的元素</li>
</ol>
<p>我的方法是:</p>
<pre><code>import re
import collections
with open(your_file_name_here, 'r') as f:
# parse each line in a dict like
# {sessionid: [(time, start/stop), ...]}
pattern = re.compile('(\d+) session:(\d+) (\w+)')
lines = f.readlines()
sessions = collections.defaultdict(list)
for line in lines:
m = re.match(pattern, line).groups()
sessions[m[1]].append((m[0],m[2]))
# for each session, sort the list
# I kept this loop separate from the next one
# since OP said he had data already sorted
for k,v in sessions.items():
sessions[k] = sorted(v, key=lambda x: x[0])
# for each session remove unwanted elements
for k, v in sessions.items():
# group elements two by two
# added a default element to manage the last element of the list
for el in zip(v,v[1:]+[('','start')]):
if el[0][1] == 'start' and el[1][1] == 'start':
v.remove(el[0])
</code></pre>
<p>文件内容示例:</p>
<pre><code>1 session:1234 start
2 session:2345 start
3 session:3456 start
4 session:1234 stop
5 session:7890 start
6 session:4567 start
7 session:2345 stop
8 session:4567 stop
</code></pre>