关于文件字体问题的Regex re.sub

2024-05-15 03:06:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个.md文件中检索前面的内容,当前面的每一个标题都在一行时,我能够检索到内容

例:

---
title: "Meeting"
date: 2019-03-14T07:51:28+01:00
draft: false
status:["process", "todo"]
---

因此,我编写了以下python脚本来获取前端内容

def get_front_matter(file, start='---', end='---'):
    """Strip file and retrieve front matter then format the value"""
    content = {}
    with open(file, 'r', encoding='UTF-8') as file_content:
        for content_line in file_content:
            if content_line.strip() == start:
                break
        for content_line in file_content:
            if content_line.strip() == end:
                break

            line_data = content_line.split(':', 1)
            # If we cannot split decently, carry on
            if len(line_data) != 2:
                continue
            # format the string to store in dict for better usage
            content[line_data[0]] = re.sub(r"[\n\t]*", "", line_data[1]).strip(' "')
    return content

但是,如果我的前母题有多行,我将面临一个问题

---
title: "Meeting"
date: 2019-03-14T07:51:28+01:00
draft: false
status:
  [
    "process",
    "todo",
    "hold"
  ]
---

当我尝试读取上面的文件front matter时,我得到一个空值status,但它应该如下所示:

{'title': 'Meeting', 'date': '2019-03-14T07:51:28+01:00', 'draft': 'false', 'teams': '["process", "todo", "hold"]'}

有没有其他方法可以根据线条或标签来阅读封面内容。我尝试使用一些正则表达式,但检索不到一组行


Tags: false内容datadatetitlestatuslinecontent
1条回答
网友
1楼 · 发布于 2024-05-15 03:06:29

我几乎保留了您的代码,关键是在我们开始之前不要为结果添加值 确保我们收集了完整的value(当它被拆分为多行时),这是通过验证下一行str来完成的,如果它是有效值(key: some value),那么将前一行key及其content添加到结果中,或者如果它是结束字符 -,我希望注释能让事情更清楚

    def get_front_matter(file, start=' -', end=' -'):
        """Strip file and retrieve front matter then format the value"""
        result = {}
        with open(file, 'r', encoding='UTF-8') as file_content:
            for content_line in file_content:
                if content_line.strip() == start:
                    break

            content = ''
            key = ''
            for content_line in file_content:
                if content_line.strip() == end:
                    if key and content:
                        # add last key: content before breaking out
                        result[key] = re.sub(r"[\n\t]*", "", content).strip (' "')
                    break

                line_data = content_line.split(':', 1)
                if len(line_data) == 2 and not content:
                    # this is our first key: content, in this point we don't have previous content so we should keep them and check the next value first
                    key = line_data[0]
                    content = line_data[1]
                    continue
                elif len(line_data) == 2:  # we found another valid value 
                    # add previous (key, content) and keep the new (key , content)
                    result[key] = re.sub(r"[\n\t]*", "", content).strip(' "')
                    key = line_data[0]
                    content = line_data[1]
                else:
                    # not a valid key: value add it to previous value because it's a value splited in multiple line
                    content += content_line

        return result

注意:我用结果更改了内容名称,此代码将因如下情况而中断:

     title: "Meeting"
    date: 2019-03-14T07:51:28+01:00
    draft: false
    status:
      [
        "somevalue:process",  # if the value contains ':'
        "todo",
        "hold"
      ]

在这里,您没有指定如何区分键和包含“:”的值(如果它前面没有键)。我希望这不会让你失望 你有问题吗

相关问题 更多 >

    热门问题