Python处理文本文件的这些标准

0 投票

4 回答

4880 浏览

提问于 2025-04-18 01:56

我正在尝试清理一个文本文件，按照一些标准来处理。

我的文本内容是这样的：

NHIST_0003 (ZS.MC.BGE.0424SPVCOS) (21.12) 14.08
(ZS.MC.BLK.0424SPVCOS) (21.12) 14.08
(ZS.MC.GRY.0424SPVCOS) (21.12) 14.08
(ZS.MC.BLK.0525SPVCOS3) (21.12) 14.08
(ZS.MC.GRY.0525SPVCOS2) (21.12) 14.08
NHIST_0004 (ZS.MC.BGE.0424SPVCOS) (21.12) 14.08

我需要做的是：如果一行的开头有任何文本，就把第一个“（”之前的所有文本都删掉，同时也要去掉我想保留的文本中的括号。我还需要去掉那些带括号的数字。看第一行，我只想保留：

ZS.MC.BGE.0424SPVC0S 14.08

这些是我尝试整理内容时写的代码。我希望不使用正则表达式，因为对我来说这太复杂了。

fileName='reach.txt'
fileName2='outreach.txt'


while True:
    f=open(fileName,'r')
    for words in f:
        x=words.split('(', 1)[-1]
        g = open(fileName2,'w')
        g.write(x)
        g.close()

这个循环是无限的。我以为关闭文件就能告诉系统停止处理行。

循环控制文件操作文本处理字符串处理文本清理数据整理括号处理数字去除

4 个回答

blacklist = set('1234567890.')
with open('reach.txt') as infile, open('outreach.txt', 'w') as outfile:
    for line in infile:
        line = line.strip()
        if not line:
            continue
        _left, line = line.split("(", 1)
        parts = [p.rstrip(")").lstrip("(") for p in line.split()]
        parts = [p for i,p in enumerate(parts) if not all(char in blacklist for char in p) or i==len(parts)-1]
        outfile.write("%s\n" %(' '.join(parts)))

根据你的例子 reach.txt，我得到了

ZS.MC.BGE.0424SPVCOS 14.08
ZS.MC.BLK.0424SPVCOS 14.08
ZS.MC.GRY.0424SPVCOS 14.08
ZS.MC.BLK.0525SPVCOS3 14.08
ZS.MC.GRY.0525SPVCOS2 14.08
ZS.MC.BGE.0424SPVCOS 14.08

回答于 2025-04-18 由 Python大师

分享举报

你可以试试用正则表达式，如果每一行都有类似 (你想要的代码) (你不想要的东西) 的格式。

import re
infile = 'reach.txt'
outfile = 'outreach.txt'

with open(infile, 'r') as inf, open(outfile, 'w') as outf:
    for line in inf:
        # each line has "* (what you want) (trash) *"
        # always take first one
        first = re.findall("(\([A-z0-9\.]*\))", line)[0]

        items = line.strip().split(" ")
        second = line[-1]
        to_write = " ".join((first, second))
        outf.write(to_write + "\n")

这个正则表达式 "(\([A-z0-9\.]*\))" 可以匹配任何组合（用 [ ]* 表示）：

字母 (A-z)，
数字 (0-9)，还有
小数点 (\.)

这些内容必须在括号内 (\( \))。

根据你的例子，结果总会有两个匹配项，像 ZS.MC.BLK.0424SPVCOS 和 21.12。使用 re.findall 可以找到这两个，顺序是你给的那样。因为你想要的那个总是第一个，所以可以用 re.findall(regex, line)[0] 来获取它。

回答于 2025-04-18 由 Python大师

分享举报

fileName='reach.txt'
fileName2='outreach.txt'

def isfloat(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

g = open(fileName2, 'w')
with open(fileName, 'r') as fh:
    for row in fh:
        x = row.split()
        for item in x:
            if '(' in item and ')' in item:
                first = item.strip('()')
                break
        for i in range(-1, 0-len(x), -1):
            second = x[i]
            if isfloat(second):
                break
        print(first, second)
        g.write(' '.join((first, second)) + '\n')
g.close()

这样就好了：

ZS.MC.BGE.0424SPVCOS 14.08
ZS.MC.BLK.0424SPVCOS 14.08
ZS.MC.GRY.0424SPVCOS 14.08
ZS.MC.BLK.0525SPVCOS3 14.08
ZS.MC.GRY.0525SPVCOS2 14.08
ZS.MC.BGE.0424SPVCOS 14.08

这段代码可以处理各种数据中的错误。例如，如果小数值不在最后面，这种情况也会被处理到。如果数据中的(...)不在第二个位置，而是在第一个位置，这种情况也会被处理到。

回答于 2025-04-18 由 Python大师

分享举报

你可以这样遍历文件中的每一行：

with open('filename.txt') as f:
    for line in f.readlines():
        #do stuff

如果你想从某一行中提取你需要的信息，可以这样做：

cleaned = []
items = line.split()
for item in items:
    if item.startswith('(') and item.endswith(')'):
        cleaned.append(item.strip('()'))
        break
cleaned.append(items[-1])
cleaned = ' '.join(cleaned)

完整的程序如下：

in_file = 'reach.txt'
out_file = 'outreach.txt'

def clean(string):
    if not string:
        return string

    cleaned = []
    items = string.split()
    for item in items:
        if item.startswith('(') and item.endswith(')'):
            cleaned.append(item.strip('()'))
            break
    cleaned.append(items[-1])
    return ' '.join(cleaned)

with open(in_file) as i, open(out_file, 'w') as o:
    o.write('\n'.join([clean(line) for line in i]))

回答于 2025-04-18 由 Python大师

分享举报

Python处理文本文件的这些标准

4 个回答

撰写回答