不等间隔分隔字段的字符串分词
我正在尝试从一个文件中提取数据,但因为文件中空格的数量不一致,我不能使用 line.split("")
这个方法。我下面复制了一些文件中的行:
"08-09-2010 21:21:46 00:22:7f:a6:9b:69 -79"
"08-09-2010 21:21:46 04:4f:aa:b4:49:49 -79"
"08-09-2010 21:21:46 04:4f:aa:31:4e:59 tikona 18002090044 -83"
"08-09-2010 21:21:46 00:22:7f:26:9b:69 tikona 18002090044 -74"
"08-09-2010 21:21:46 04:4f:aa:34:0d:c9 tikona 18002090044 -82"
"08-09-2010 21:21:46 04:4f:aa:71:4e:59 -85"
"08-09-2010 21:21:46 04:4f:aa:34:21:89 tikona 18002090044 -75"
"08-09-2010 21:21:46 04:4f:aa:34:49:49 tikona 18002090044 -77"
"08-09-2010 21:21:46 04:4f:aa:74:0d:c9 -85"
"08-09-2010 21:22:47 18 APs were seen
"
我需要访问第一列(这是一个 datetime
对象),第二列(像 00:22...
这样的内容)和最后一列(比如 -79
等等)。我可以顺利访问第一列和第二列,但最后一列就有点麻烦了。当我使用 info=line.split("")
时,由于第三列可能有也可能没有内容,我无法确定数据的数量。
我该如何访问第四列呢?有没有办法可以使用 info[i].contains(" -")
来实现?
4 个回答
1
你可以用正则表达式来分割它,
#!/usr/bin/env python
import re
mac_data_re = re.compile(
r'^(?P<date>[\d-]+)\s+' +
r'(?P<time>[\d:]+)\s+' +
r'(?P<mac>[\da-f:]+)\s+' +
r'(?P<host>\w+){0,1}\s+' +
r'(?P<host_id>\d+){0,1}\s+'
r'(?P<final_number>-{0,1}\d+)$')
with file('list') as f:
for line in (l.strip() for l in f):
match = mac_data_re.match(line)
if match:
print "date={date}, time={time}, mac={mac}, host={host}, host_id={host_id} final_number={final_number}".format(**match.groupdict())
else:
print "Line not matched: '%s'" % line
这是输出结果,
aid@bullet:~/tmp$ ./parse_list.py
date=08-09-2010, time=21:21:46, mac=00:22:7f:a6:9b:69, host=None, host_id=None final_number=-79
date=08-09-2010, time=21:21:46, mac=04:4f:aa:b4:49:49, host=None, host_id=None final_number=-79
date=08-09-2010, time=21:21:46, mac=04:4f:aa:31:4e:59, host=tikona, host_id=18002090044 final_number=-83
date=08-09-2010, time=21:21:46, mac=00:22:7f:26:9b:69, host=tikona, host_id=18002090044 final_number=-74
date=08-09-2010, time=21:21:46, mac=04:4f:aa:34:0d:c9, host=tikona, host_id=18002090044 final_number=-82
date=08-09-2010, time=21:21:46, mac=04:4f:aa:71:4e:59, host=None, host_id=None final_number=-85
date=08-09-2010, time=21:21:46, mac=04:4f:aa:34:21:89, host=tikona, host_id=18002090044 final_number=-75
date=08-09-2010, time=21:21:46, mac=04:4f:aa:34:49:49, host=tikona, host_id=18002090044 final_number=-77
date=08-09-2010, time=21:21:46, mac=04:4f:aa:74:0d:c9, host=None, host_id=None final_number=-85
Line not matched: '08-09-2010 21:22:47 18 APs were seen'
7
这些列看起来是固定宽度的,如果是这样的话,你可以用字符串切片,然后再用 .strip()
来去掉末尾的空格。
>>> for line in data.split('\n'):
... print (line[1:25].strip(), line[26:45].strip(), line[46:69].strip(), line[70:-1].strip())
...
('08-09-2010 21:21:46', '00:22:7f:a6:9b:69', '', '-79')
('08-09-2010 21:21:46', '04:4f:aa:b4:49:49', '', '-79')
('08-09-2010 21:21:46', '04:4f:aa:31:4e:59', 'tikona 18002090044', '-83')
('08-09-2010 21:21:46', '00:22:7f:26:9b:69', 'tikona 18002090044', '-74')
('08-09-2010 21:21:46', '04:4f:aa:34:0d:c9', 'tikona 18002090044', '-82')
('08-09-2010 21:21:46', '04:4f:aa:71:4e:59', '', '-85')
('08-09-2010 21:21:46', '04:4f:aa:34:21:89', 'tikona 18002090044', '-75')
('08-09-2010 21:21:46', '04:4f:aa:34:49:49', 'tikona 18002090044', '-77')
('08-09-2010 21:21:46', '04:4f:aa:74:0d:c9', '', '-85')
('08-09-2010 21:22:47', '18 APs were seen', '', '')
('', '', '', '')
这里的 ('', '', '', '')
是因为最后一行输入是 "
。
如果这些列不是固定宽度的,你仍然可以使用 .split()
来分割字符串,并用索引 -1
来获取最后一列。不过在这里使用 .split()
要小心,因为如果用错了会变得很麻烦。我建议用两个空格作为分隔符,这样可以处理 18 APs were seen
这种情况,但要注意这会改变第二列的索引。
>>> for line in data.split('\n'):
... fields = line.split(' ')
... print (fields[0], fields[3], fields[-1])
...
('"08-09-2010 21:21:46', '00:22:7f:a6:9b:69', ' -79"')
('"08-09-2010 21:21:46', '04:4f:aa:b4:49:49', ' -79"')
('"08-09-2010 21:21:46', '04:4f:aa:31:4e:59', '-83"')
('"08-09-2010 21:21:46', '00:22:7f:26:9b:69', '-74"')
('"08-09-2010 21:21:46', '04:4f:aa:34:0d:c9', '-82"')
('"08-09-2010 21:21:46', '04:4f:aa:71:4e:59', ' -85"')
('"08-09-2010 21:21:46', '04:4f:aa:34:21:89', '-75"')
('"08-09-2010 21:21:46', '04:4f:aa:34:49:49', '-77"')
('"08-09-2010 21:21:46', '04:4f:aa:74:0d:c9', ' -85"')
('"08-09-2010 21:22:47', '18 APs were seen', '18 APs were seen')
('"08-09-2010 21:21:46', '00:22:7f:26:9b:69', '-74"')
Traceback (most recent call last):
File "<input>", line 3, in <module>
IndexError: list index out of range
出现 IndexError
是因为你的最后一行输入。 如果这是实际输入,你应该处理这个错误。
0
你可以用rsplit方法来获取最后一个值,类似于这样使用:"".rsplit(" ",1)。