如何提取前缀和后缀之间的内容？

2条回答

网友

1楼 · 编辑于 2024-04-20 10:48:15

如果您只想要所有文件的(section-level, title)对，可以使用一个简单的regex：

import re

codewords = [
    'section',
    'subsection',
    # add other here if you want to
]

regex = re.compile(r'\\({})\{{([^}}]+)\}}'.format('|'.join(re.escape(word) for word in codewords)))

示例用法：

^{pr2}$

通过更改codewords列表的值，您将能够匹配更多种类的命令。在

要将其应用于文件，只需先read()它：

with open('myfile.tex') as f:
    regex.findall(f.read())

如果您可以保证所有这些命令都在同一行上，那么您可以提高内存效率并执行以下操作：

打开（'我的文件.tex'）作为f：结果=[] 对于f行：结果.延伸(正则表达式findall（线路）

或者如果你想更花哨一点：

from itertools import chain

with open('myfile.tex') as f:
    results = chain.from_iterable(map(regex.findall, f))

但是，请注意，如果您有以下情况：

\section{A very 
    long title}

这将失败，为什么使用read()的解决方案也会得到该部分。在

在任何情况下，你必须意识到，格式上的细微变化都会破坏这种解决方案。因此，为了更安全，您必须寻找一个合适的乳胶解析器。在

如果您想将给定部分中“包含”的子部分组合在一起，则可以在使用上述解决方案获得结果后进行分组。你必须使用类似itertools.groupby的东西。在

从itertools导入groupby、count、chain

results = regex.findall(text)

def make_key(counter):
    def key(match):
        nonlocal counter
        val = next(counter)
        if match[0] == 'section':
            val = next(counter)
        counter = chain([val], counter)
        return val
    return key

organized_result = {}

for key, group in groupby(results, key=make_key(count())):
    _, section_name = next(group)
    organized_result[section_name] = section = []
    for _, subsection_name in group:
        section.append(subsection_name)

最终结果将是：

In [12]: organized_result
Out[12]: 
{'First section': ['Subsection one', 'Subsection two', 'Subsection three'],
 'Second section': [],
 'Third section': ['Last subsection']}

它与文章开头的文本结构相匹配。在

如果您想使用codewords列表使其可扩展，那么事情会变得复杂得多。在

网友

2楼 · 编辑于 2024-04-20 10:48:15

我想您应该使用正则表达式模块。在

import re

s = "This is a string of an \section{example file} used for \subsection{Latex} documents."

pattern = re.compile(r'\\(?:sub)?section\{(.*?)\}')
re.findall(pattern, s)

#output:
['example file', 'Latex']

相关问题更多 >

编程相关推荐

热门问题

热门文章