Python格式化输出

2024-05-16 09:02:29 发布

您现在位置:Python中文网/ 问答频道 /正文

对于以下二进制文件(可从here下载):

*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA AD AE AG AI AN BI BL CF CH CL CS CT EC HI IM IP ME PD PK PO RE SD ST TO TU UR
ENTRY = A-23187|T109|T195|LAB|NRW|NLM (1991)|900308|abbcdef
ENTRY = A23187|T109|T195|LAB|NRW|UNK (19XX)|741111|abbcdef
ENTRY = Antibiotic A23187|T109|T195|NON|NRW|NLM (1991)|900308|abbcdef
ENTRY = A 23187
ENTRY = A23187, Antibiotic
MN = D03.633.100.221.173
PA = Anti-Bacterial Agents
PA = Calcium Ionophores
MH_TH = FDA SRS (2014)
MH_TH = NLM (1975)
ST = T109
ST = T195
N1 = 4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrrol-2-yl)ethyl)-1,7-dioxaspiro(5.5)undec-2-yl)methyl)-, (6S-(6alpha(2S*,3S*),8beta(R*),9beta,11alpha))-
RN = 37H9VM9WZL
RR = 52665-69-7 (Calcimycin)
PI = Antibiotics (1973-1974)
PI = Carboxylic Acids (1973-1974)
MS = An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports CALCIUM and other divalent cations across membranes and uncouples oxidative phosphorylation while inhibiting ATPase of rat liver mitochondria. The substance is used mostly as a biochemical tool to study the role of divalent cations in various biological systems.
OL = use CALCIMYCIN to search A 23187 1975-90
PM = 91; was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)
HN = 91(75); was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)
MR = 20160527
DA = 19741119
DC = 1
DX = 19840101
UI = D000001

*NEWRECORD
RECTYPE = D
MH = Temefos
AQ = AA AD AE AG AI AN BL CF CH CL CS CT EC HI IM IP ME PD PK RE SD ST TO TU UR
ENTRY = Abate|T109|T131|TRD|NRW|NLM (1996)|941114|abbcdef
ENTRY = Difos|T109|T131|TRD|NRW|UNK (19XX)|861007|abbcdef
ENTRY = Temephos|T109|T131|TRD|EQV|NLM (1996)|941201|abbcdef
MN = D02.705.400.625.800
MN = D02.705.539.345.800
MN = D02.886.300.692.800
PA = Insecticides
MH_TH = FDA SRS (2014)
MH_TH = INN (19XX)
MH_TH = USAN (1974)
ST = T109
ST = T131
N1 = Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester
RN = ONP3ME32DL
RR = 3383-96-8 (Temefos)
AN = for use to kill or control insects, use no qualifiers on the insecticide or the insect; appropriate qualifiers may be used when other aspects of the insecticide are discussed such as the effect on a physiologic process or behavioral aspect of the insect; for poisoning, coordinate with ORGANOPHOSPHATE POISONING
PI = Insecticides (1966-1971)
MS = An organothiophosphate insecticide.
PM = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90)
HN = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90)
MR = 20130708
DA = 19990101
DC = 1
DX = 19910101
UI = D000002

我有以下Python代码:

import re

terms = {}
numbers = {}

meshFile = 'd2017.bin'
with open(meshFile, mode='rb') as file:
    mesh = file.readlines()

outputFile = open('mesh.txt', 'w')

for line in mesh:
    meshTerm = re.search(b'MH = (.+)$', line)
    if meshTerm:
        term = meshTerm.group(1)
    meshNumber = re.search(b'MN = (.+)$', line)
    if meshNumber:
        number = meshNumber.group(1)
        numbers[str(number)] = term
        if term in terms:
            terms[term] = terms[term] + ' ' + str(number)
        else:
            terms[term] = str(number)

cumlist = []
keylist = terms.keys()
for key in keylist:
    #print('THE ORIGIN FOR ', key, file=outputFile)

    item_list = terms[key].split(" ")
    for phrase in item_list:
        cumlist.append(phrase)

print(cumlist)

for item in cumlist:
    print(numbers[str(item)], '\n', item, file=outputFile)

输出如下所示:

b'Calcimycin\r' 
 b'D03.633.100.221.173\r'
b'Temefos\r' 
 b'D02.705.400.625.800\r'
b'Temefos\r' 
 b'D02.705.539.345.800\r'
b'Temefos\r' 
 b'D02.886.300.692.800\r'

如何将输出重新格式化为如下所示:

Calcimycin 
D03.633.100.221.173
Temefos 
D02.705.400.625.800
D02.705.539.345.800
D02.886.300.692.800

谢谢。你知道吗


Tags: theinfornlmstentrytermterms
1条回答
网友
1楼 · 发布于 2024-05-16 09:02:29
UPDATE: I simplified the source a bit

你可以试试这个正则表达式:

MH\s*=\s*(\w+)\s*|MN\s*= \s*([^\s]*)

Demo

示例代码:(Run it here

   import re

regex = r"MH\s*=\s*(\w+)\s*|MN\s*= \s*([^\s]*)"

test_str = ("*NEWRECORD\n"
    "RECTYPE = D\n"
    "MH = Calcimycin\n"
    "AQ = AA AD AE AG AI AN BI BL CF CH CL CS CT EC HI IM IP ME PD PK PO RE SD ST TO TU UR\n"
    "ENTRY = A-23187|T109|T195|LAB|NRW|NLM (1991)|900308|abbcdef\n"
    "ENTRY = A23187|T109|T195|LAB|NRW|UNK (19XX)|741111|abbcdef\n"
    "ENTRY = Antibiotic A23187|T109|T195|NON|NRW|NLM (1991)|900308|abbcdef\n"
    "ENTRY = A 23187\n"
    "ENTRY = A23187, Antibiotic\n"
    "MN = D03.633.100.221.173\n"
    "PA = Anti-Bacterial Agents\n"
    "PA = Calcium Ionophores\n"
    "MH_TH = FDA SRS (2014)\n"
    "MH_TH = NLM (1975)\n"
    "ST = T109\n"
    "ST = T195\n"
    "N1 = 4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrrol-2-yl)ethyl)-1,7-dioxaspiro(5.5)undec-2-yl)methyl)-, (6S-(6alpha(2S*,3S*),8beta(R*),9beta,11alpha))-\n"
    "RN = 37H9VM9WZL\n"
    "RR = 52665-69-7 (Calcimycin)\n"
    "PI = Antibiotics (1973-1974)\n"
    "PI = Carboxylic Acids (1973-1974)\n"
    "MS = An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports CALCIUM and other divalent cations across membranes and uncouples oxidative phosphorylation while inhibiting ATPase of rat liver mitochondria. The substance is used mostly as a biochemical tool to study the role of divalent cations in various biological systems.\n"
    "OL = use CALCIMYCIN to search A 23187 1975-90\n"
    "PM = 91; was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n"
    "HN = 91(75); was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n"
    "MR = 20160527\n"
    "DA = 19741119\n"
    "DC = 1\n"
    "DX = 19840101\n"
    "UI = D000001\n\n"
    "*NEWRECORD\n"
    "RECTYPE = D\n"
    "MH = Temefos\n"
    "AQ = AA AD AE AG AI AN BL CF CH CL CS CT EC HI IM IP ME PD PK RE SD ST TO TU UR\n"
    "ENTRY = Abate|T109|T131|TRD|NRW|NLM (1996)|941114|abbcdef\n"
    "ENTRY = Difos|T109|T131|TRD|NRW|UNK (19XX)|861007|abbcdef\n"
    "ENTRY = Temephos|T109|T131|TRD|EQV|NLM (1996)|941201|abbcdef\n"
    "MN = D02.705.400.625.800\n"
    "MN = D02.705.539.345.800\n"
    "MN = D02.886.300.692.800\n"
    "PA = Insecticides\n"
    "MH_TH = FDA SRS (2014)\n"
    "MH_TH = INN (19XX)\n"
    "MH_TH = USAN (1974)\n"
    "ST = T109\n"
    "ST = T131\n"
    "N1 = Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester\n"
    "RN = ONP3ME32DL\n"
    "RR = 3383-96-8 (Temefos)\n"
    "AN = for use to kill or control insects, use no qualifiers on the insecticide or the insect; appropriate qualifiers may be used when other aspects of the insecticide are discussed such as the effect on a physiologic process or behavioral aspect of the insect; for poisoning, coordinate with ORGANOPHOSPHATE POISONING\n"
    "PI = Insecticides (1966-1971)\n"
    "MS = An organothiophosphate insecticide.\n"
    "PM = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90)\n"
    "HN = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90)\n"
    "MR = 20130708\n"
    "DA = 19990101\n"
    "DC = 1\n"
    "DX = 19910101\n"
    "UI = D000002\n\n\n\n\n\n\n"
    "Calcimycin \n"
    "D03.633.100.221.173\n"
    "Temefos \n"
    "D02.705.400.625.800\n"
    "D02.705.539.345.800\n"
    "D02.886.300.692.800")

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        if(match.group(groupNum) is not None):
          print(match.group(groupNum))

样本输出:

Calcimycin
D03.633.100.221.173
Temefos
D02.705.400.625.800
D02.705.539.345.800
D02.886.300.692.800

相关问题 更多 >