从文件中提取特定行并在Python中创建数据区块

2 投票

3 回答

3019 浏览

数据工程师

提问于 2025-04-17 02:28

我正在尝试写一个Python脚本，从一个文件中提取行。这个文件是一个文本文件，里面是Python suds输出的内容。

我想要做的是：

去掉所有字符，只保留单词和数字。我不想要任何"\n"、"["、"]"、"{"、"="等字符。
找到一个以"ArrayOf_xsd_string"开头的部分。
从结果中去掉下一行"item[] ="。
抓取剩下的6行，并根据第五行的唯一数字（比如123456、234567、345678）创建一个字典，用这个数字作为键，剩下的行作为值（如果我没有用Python的术语解释清楚，请见谅）。
把结果输出到一个文件中。

文件中的数据是一个列表：

[(ArrayOf_xsd_string){
   item[] = 
      "001",
      "ABCD",
      "1234",
      "wordy type stuff",
      "123456",
      "more stuff, etc",
 }, (ArrayOf_xsd_string){
   item[] = 
      "002",
      "ABCD",
      "1234",
      "wordy type stuff",
      "234567",
      "more stuff, etc",
 }, (ArrayOf_xsd_string){
   item[] = 
      "003",
      "ABCD",
      "1234",
      "wordy type stuff",
      "345678",
      "more stuff, etc",
 }]

我尝试使用re.compile，这里是我糟糕的代码尝试：

import re, string

f = open('data.txt', 'rb')
linelist = []
for line in f:
  line = re.compile('[\W_]+')
 line.sub('', string.printable)
 linelist.append(line)
 print linelist

newlines = []
for line in linelist:
    mylines = line.split()
    if re.search(r'\w+', 'ArrayOf_xsd_string'):
      newlines.append([next(linelist) for _ in range(6)])
      print newlines

我还是个Python新手，在谷歌或StackOverflow上没有找到关于如何在找到特定文本后提取特定行数的结果。任何帮助都非常感谢。

请忽略我的代码，因为我在“盲目尝试” :)

这是我希望看到的结果：

123456: 001,ABCD,1234,wordy type stuff,more stuff etc
234567: 002,ABCD,1234,wordy type stuff,more stuff etc
345678: 003,ABCD,1234,wordy type stuff,more stuff etc

我希望这能帮助你理解我不太完美的代码。

正则表达式文件操作文本处理数据提取数据格式化字典创建行处理文本清洗

3 个回答

让我们来玩玩迭代器吧!

class SudsIterator(object):
    """extracts xsd strings from suds text file, and returns a 
    (key, (value1, value2, ...)) tuple with key being the 5th field"""
    def __init__(self, filename):
        self.data_file = open(filename)
    def __enter__(self):  # __enter__ and __exit__ are there to support 
        return self       # `with SudsIterator as blah` syntax
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.data_file.close()
    def __iter__(self):
        return self
    def next(self):     # in Python 3+ this should be __next__
        """looks for the next 'ArrayOf_xsd_string' item and returns it as a
        tuple fit for stuffing into a dict"""
        data = self.data_file
        for line in data:
            if 'ArrayOf_xsd_string' not in line:
                continue
            ignore = next(data)
            val1 = next(data).strip()[1:-2] # discard beginning whitespace,
            val2 = next(data).strip()[1:-2] #   quotes, and comma
            val3 = next(data).strip()[1:-2]
            val4 = next(data).strip()[1:-2]
            key = next(data).strip()[1:-2]
            val5 = next(data).strip()[1:-2]
            break
        else:
            self.data_file.close() # make sure file gets closed
            raise StopIteration()  # and keep raising StopIteration
        return key, (val1, val2, val3, val4, val5)

data = dict()
for key, value in SudsIterator('data.txt'):
    data[key] = value

print data

回答于 2025-04-17 由 Python大师

分享举报

如果你想在找到某一特定行后提取特定数量的行，可以先用readlines把文件内容读入一个数组，然后循环查找匹配的行，再从这个数组中取出接下来的N行。另外，如果文件比较大，使用while循环配合readline会更好。

下面是我能想到的最简单的代码修复方法，但这不一定是最好的实现方式。如果没有特别的理由，建议你按照我上面提到的建议去做，除非你只是想尽快完成任务。

newlines = []
for i in range(len(linelist)):
    mylines = linelist[i].split()
    if re.search(r'\w+', 'ArrayOf_xsd_string'):
        for l in linelist[i+2:i+20]:
            newlines.append(l)
        print newlines

如果我理解你的需求没错，这段代码应该能满足你的要求。它的意思是：取出匹配行之后的下一行，以及接下来的17行（也就是说，取到匹配行之后的第20行，但不包括它），然后把这些行添加到newlines中（你不能一次性把整个列表添加进去，这样会把列表当成一个单独的索引来处理）。

祝你好运，玩得开心！ :)

回答于 2025-04-17 由 Python大师

分享举报

关于你的代码，有几个建议：

去掉所有非字母数字的字符其实没必要，而且还浪费时间；完全可以不需要构建 linelist。你知道其实可以直接用简单的 string.find("ArrayOf_xsd_string") 或者 re.search(...) 吗？

去掉所有除了字母和数字以外的字符。我不想要任何 "\n"、"["、"]"、"{"、"=" 等字符。
找到一个以 "ArrayOf_xsd_string" 开头的部分。
从结果中去掉下一行 "item[] ="。

至于你的正则表达式，_ 已经被 \W 包含了。不过，下面这行重新赋值会覆盖你刚刚读取的内容？？

for line in f:
  line = re.compile('[\W_]+') # overwrites the line you just read??
  line.sub('', string.printable)

这是我的版本，它直接读取文件，并且可以处理多个匹配：

with open('data.txt', 'r') as f:
    theDict = {}
    found = -1
    for (lineno,line) in enumerate(f):
        if found < 0:
            if line.find('ArrayOf_xsd_string')>=0:
                found = lineno
                entries = []
            continue
        # Grab following 6 lines...
        if 2 <= (lineno-found) <= 6+1:
            entry = line.strip(' ""{}[]=:,')
            entries.append(entry)
        #then create a dict with the key from line 5
        if (lineno-found) == 6+1:
            key = entries.pop(4)
            theDict[key] = entries
            print key, ','.join(entries) # comma-separated, no quotes
            #break # if you want to end on first match
            found = -1 # to process multiple matches

而且输出正是你想要的（这就是 ','.join(entries) 的作用）：

123456 001,ABCD,1234,wordy type stuff,more stuff, etc
234567 002,ABCD,1234,wordy type stuff,more stuff, etc
345678 003,ABCD,1234,wordy type stuff,more stuff, etc

回答于 2025-04-17 由 Python大师

分享举报

从文件中提取特定行并在Python中创建数据区块

3 个回答

撰写回答