从文本文件中解析项目

3 投票

4 回答

837 浏览

数据工程师

提问于 2025-04-15 23:57

我有一个文本文件，里面有一些数据被放在{[]}这样的标签里。我想知道有什么好的方法可以提取这些标签里的数据，好让我直接使用。

这个文本文件的内容大概是这样的：

'这是一堆没有用的文本，里面有一些 {[真的]} 没有 {[任何]} 用处。我需要从中 {[获取]} 一些 {[项目]}。'

我希望最后能得到一个列表，里面包含 '真的'、'任何'、'获取'、'项目' 这些词。我想我可以用分割的方法来实现，但感觉可能还有更好的办法。我看到有很多解析库，有没有一个特别适合我想做的事情？

正则表达式文本处理数据提取文本解析数据清洗解析库字符串分割

4 个回答

慢一点，变大一点，没有正则表达式

这就是老派的做法 :P

def f(s):
    result = []
    tmp = ''
    for c in s:
        if c in '{[':
            stack.append(c)
        elif c in ']}':
            stack.pop()
            if c == ']':
                result.append(tmp)
                tmp = ''
        elif stack and stack[-1] == '[':
            tmp += c
    return result

>>> s
'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'
>>> f(s)
['really', 'way', 'get', 'from']

回答于 2025-04-15 由 Python大师

分享举报

这是一项需要用到正则表达式的工作：

>>> import re
>>> text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'
>>> re.findall(r'\{\[(\w+)\]\}', text)
['really', 'way', 'get', 'from']

回答于 2025-04-15 由 Python大师

分享举报

我会使用正则表达式。这个回答假设在其他标签字符中不会出现这些标签字符 {}[]。

import re
text = 'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'

for s in re.findall(r'\{\[(.*?)\]\}', text):
    print s

在Python的正则表达式中使用详细模式：

re.findall('''
    \{   # opening curly brace
    \[   # followed by an opening square bracket
    (    # capture the next pattern
    .*?  # followed by shortest possible sequence of anything
    )    # end of capture
    \]   # followed by closing square bracket
    \}   # followed by a closing curly brace
    ''', text, re.VERBOSE)

回答于 2025-04-15 由 Python大师

分享举报

从文本文件中解析项目

4 个回答

撰写回答