Python去除字符串中的制表符并标记化列表
我尝试了很多次,但就是不行。
输入内容:
condor t airline airline
eight n 0 flightnumber
nine n 0 flightnumber
five n 0 flightnumber
hallo t 0 sentence
turn t com turn_heading
left t 0 direction
heading t com turn_heading
three n 0 degree_absolute
two n 0 degree_absolute
zero n 0 degree_absolute
期望的输出:
<s> <callsign> <airline> condor </airline> <flightnumber> eight nine five </flightnumber> </callsign> hallo <command="turn_heading"> turn <direction> left </direction> heading <degree_absolute> three two zero </degree_absolute> </command> </s>
每次我尝试输入内容时,制表符总是妨碍我把字符串分开,尽管我把它们作为列表或字符串输入。这是我尝试去掉制表符时发生的情况:
['condor\tt\tairline\tairline\n', 'eight\tn\t \tflightnumber\n', 'nine\tn\t \tflightnumber\n', 'five\tn\t \tflightnumber\n', 'hallo\tt\t \tsentence\n', 'turn\tt\tcom\tturn_heading\n', 'left\tt\t \tdirection\n', 'heading\tt\tcom\tturn_heading\n', 'three\tn\t \tdegree_absolute\n', 'two\tn\t \tdegree_absolute\n', 'zero\tn\t \tdegree_absolute\n', '\n', 'aeh\tt\t \tsentence\n', 'two\tn\t \tflightnumber\n', 'eight\tn\t \tflightnumber\n', 'november\tt\tflightnumber\tflightnumber\n', 'hallo\tt\t \tsentence\n', 'reduce\tt\tcom\treduce\n', 'two\tn\t \tspeed\n', 'two\tn\t \tspeed\n', 'zero\tn\t \tspeed\n', 'knots\tt\t \treduce\n', '\n', 'condor\tt\tairline\tairline\n', 'eight\tn\t \tflightnumber\n', 'nine\tn\t \tflightnumber\n', 'five\tn\t \tflightnumber\n', 'descend\tt\tcom\tdescend\n', 'three\tn\t \taltitude\n', 'thousand\tn\t \taltitude\n', 'feet\tt\t \tdescend\n', 'turn\tt\tcom\tturn_heading\n', 'left\tt\t \tdirection\n', 'heading\tt\tcom\tturn_heading\n', 'two\tn\t \tdegree_absolute\n', 'six\tn\t \tdegree_absolute\n', 'zero\tn\t \tdegree_absolute\n', 'cleared\tt\tcom\tcleared_ils\n', 'ils\tt\t \tcleared_ils\n', 'runway\tt\t \tcleared_ils\n', 'two\tn\t \trunway\n', 'three\tn\t \trunway\n', 'left\tt\t \trunway\n', 'turn\tt\tcom\tturn_heading\n', 'left\tt\t \tdirection\n', 'heading\tt\tcom\tturn_heading\n', 'two\tn\t \tdegree_absolute\n', 'five\tn\t \tdegree_absolute\n', 'zero\tn\t \tdegree_absolute\n']
有没有什么办法可以让我去掉制表符,然后把它们分开并转换成标记格式呢?
我一直在用来去除控制字符的代码:
import string
with open('input.txt', 'r') as file1:
lines = str(list(file1))
print lines.translate(string.maketrans("\n\t\r", " "))
1 个回答
3
如果你使用 csv
模块,这件事就非常简单:
>>> import csv
>>> f = ["condor\tt\tairline\tairline",
"eight\tn\t0\tflightnumber",
"nine\tn\t0\tflightnumber",
"turn\tt\tcom\tturn_heading",
"left\tt\t0\tdirection"] # fake 'file' for testing
>>> list(csv.DictReader(f, delimiter="\t"))
[{'condor': 'eight', 't': 'n', 'airline': 'flightnumber'},
{'condor': 'nine', 't': 'n', 'airline': 'flightnumber'},
{'condor': 'turn', 't': 't', 'airline': 'turn_heading'},
{'condor': 'left', 't': 't', 'airline': 'direction'}]
注意,我在这里指定了 delimiter='\t'
,这表示我使用的是以制表符分隔的文件(而不是默认的逗号分隔)。我还用了 DictReader
,这样每一行就会自动变成一个字典,格式是 {字段名: 值, ...}
。
然后你可以把这些字典处理成你想要的任何格式。