Python正则表达式在EDI文档上

E2EDP10001 300 5 0 02 4M0 035 503 K C5M ANTENNE 24 L06S2 CKL 17105098 16.0 20170516 32.0 2 03 000006 E2EDP16001 300 6 5 033D20170609 20170609 24 E2EDP16001 300 7 5 033D20170630 20170630 0 E2EDP16001 300 8 5 033D20170728 20170728 8 E2EDP16001 300 9 5 033I20170731 20170806 8 E2EDP16001 300 10 5 033I20170828 20170903 8 E2EDP16001 300 11 5 033I20170918 20170924 8 E2EDP16001 300 12 5 033I20171016 20171022 8 E2EDP16001 300 13 5 033I20171023 20171029 0 E2EDP16001 300 14 5 033I20171030 20171105 1

import os import sys import re abs_path = os.path.dirname(sys.argv[0]) #delfor_folder delfor_path = abs_path+r"/DELFOR" delfor_archive_path = delfor_path+"/Archive" #deljit_folder deljit_path = abs_path+r"/DELJIT" deljit_archive_path = deljit_path+"/Archive" #delins_folder delins_path = abs_path+r"/DELINS" delins_archive_path = delins_path+"/Archive" counter = 0 #ciclo i delfor for file in os.listdir(delfor_archive_path): if os.path.isdir(file): continue elif file.__contains__(".txt"): data = [] #partner = 'unknown' with open(delfor_archive_path+"/"+file) as f_input: for row in f_input: try: data_row = re.match(r'(\d+) +(\d+) +(\d+) +(\d{8}) +(\d{8})', row) continue except: data_row = re.match(r'(\d{3})+I(\d+) +(\d{8}) +(\d+)',row) cliente_row = re.match(r'EDKA1003 +(\d+) +(\d+) +(\w+) +(\w+)', row) materiale_row = re.match(r'EDP10001 +(\d) +(\d+) +(\w+)', row) #TODO togliere i primi due caratteri dell'ultimo gruppo print(cliente_row) ''' if data_row: if data_row.groups()[0]: data.append([partner] + list(data_row.groups()[:-1])) else: partner = data_row.groups()[-1] '''

1条回答

网友

1楼 · 发布于 2024-06-07 03:46:48

我不太明白你想做什么。看起来你有一些不同列数的混合表格记录。所以我将使用split函数row并获取列列表。您可以只对列使用regex，也可以根据需要对列的字符串数据进行除法。你知道吗

这是我的代码-仅部分：

data = list()
with open('data.txt') as f_input:
    for row in f_input:
        cols = row.split()
        if len(cols) == 8:
            data.append(cols[:4] + ['**' + ' '.join(cols[4:8]) + '**'])
        elif len(cols) == 7:
            data.append(cols)

print str(data).replace('],', '],\n')

程序输出：

[['E2EDP10001', '300', '5', '0', '**02 4M0 035 503**'],
 ['E2EDP16001', '300', '6', '5', '033D20170609', '20170609', '24'],
 ['E2EDP16001', '300', '7', '5', '033D20170630', '20170630', '0'],
 ['E2EDP16001', '300', '8', '5', '033D20170728', '20170728', '8'],
 ['E2EDP16001', '300', '9', '5', '033I20170731', '20170806', '8'],
 ['E2EDP16001', '300', '10', '5', '033I20170828', '20170903', '8'],
 ['E2EDP16001', '300', '11', '5', '033I20170918', '20170924', '8'],
 ['E2EDP16001', '300', '12', '5', '033I20171016', '20171022', '8'],
 ['E2EDP16001', '300', '13', '5', '033I20171023', '20171029', '0'],
 ['E2EDP16001', '300', '14', '5', '033I20171030', '20171105', '1']]

如果与字符串不匹配，re.match不会抛出异常，只返回none。所以，我觉得用try ... except块没有意义。你知道吗

如果你问如何改进你的代码。我不喜欢，你用这个：

file.__contains__(".txt")我将替换为：file.endswith(".txt")

我每次都尝试用os.path.join(dir1, dir2, ...)连接路径

另一件事是，如果需要用regex解析大量数据，最好使用编译后的正则表达式。你知道吗

例如：

在您定义的脚本开始时

DATA_ROW_RE = re.compile(r'(\d{3})+I(\d+) +(\d{8}) +(\d+)')

稍后在脚本中，您将只使用：

data_row = DATA_ROW_RE.match(row)

相关问题更多 >

编程相关推荐

热门问题

热门文章