我如何使用正则表达式模式从结尾开始的第一句话，用python写出最后一句话

"TITULO: Albedo SUBTITULO Y PARRAFO: ===Trees=== Because forests generally have a low albedo, (the majority of the ultraviolet and [[visible spectrum]] is absorbed through [[photosynthesis]]) " "TITULO: Albedo SUBTITULO Y PARRAFO: ==Human activities== Human activities (e.g., deforestation, farming, and urbanization) change the albedo of various areas around "TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ===U.S. House of Representatives, 1847–1849=== [[File:Abraham Lincoln by Nicholas Shepherd, 1846-crop.jpg|thumb|upright|alt=Middle

Tile Head TEXT Albedo Trees Because forests generally have a low ...([[photosynthesis]]) Albedo Human activities Human activities (e.g., de...areas around Abraham Lincoln U.S. House of..1849 [[File:Abraham Lincoln by... line Whig,

import re from collections import defaultdict import pandas as pd pandas_dict = defaultdict(list) with open("datos_titulos.csv", "r") as f: for line in f: pat = r"TITULO: (.*) SUBTITULO Y PARRAFO: ==(.*?)==|rTITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)===" pat2 = r"TITULO: (.*) SUBTITULO Y PARRAFO: ==(.*?)==$|rTITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)===$" if re.search(pat, line) : pandas_dict["title"].append(re.search(pat, line).group(1)) pandas_dict["head"].append(re.search(pat, line).group(2)) if re.search(pat2, line) : pandas_dict["text"].append(re.search(pat2, line).group(2)) df = pd.DataFrame(pandas_dict)

1条回答

网友

1楼 · 发布于 2024-04-23 20:33:26

import re
from collections import defaultdict
import pandas as pd

pandas_dict = defaultdict(list)

regx_title = r"TITULO: (.*) SUBTITULO"
regx_head = r"={2,}(.*?)={2,}"
regx_text = r"^(?!\")(.+)"

regex_list = [regx_title, regx_head, regx_text]

with open("datos_titulos.csv", "r") as f:
    for line in f:
        for i, regx in enumerate(regex_list):
            r = re.findall(regx, line)
            if r:
                pandas_dict[i].append(r[0])

df = pd.DataFrame(pandas_dict)
df = df.rename(columns={0:"Title", 1:"Head", 2:"TEXT"})

with pd.option_context('display.max_colwidth', 25):
    print(df)

细节

"TITULO: (.*) SUBTITULO"
- (.*)-捕获组：除换行符以外的任何0个或更多字符
={2,}(.*?)={2,}
- ={2,}-两个或多个字符=
- (.*?)-除换行符以外的任何0个或更多字符（尽可能少的次数）
- ={2,}-两个或多个字符=
^(?!\")(.+)
- ^-行的开始
- (?!-Negative Lookahead（断言下面的正则表达式不匹配）
  - \"-按字面意思匹配字符"
- )-关闭负前瞻
- (.+)-捕获组：除换行符以外的任何1个或多个字符

此外，最后一个正则表达式捕获任何不以"开头的行，因此请确保所需的text字段是该行中除"之外唯一开始的内容。从您的示例文本这在提供的输出中可以看到，但如果出于任何原因，text字段在同一行中，您可以使用以下正则表达式：

TITULO: (.*) SUBTITULO|={2,}(.*?)={2,}|(?<===)(.+)

使用此代码：

import pandas as pd
import re

regx = r"TITULO: (.*) SUBTITULO|={2,}(.*?)={2,}|(?<===)(.+)"

df1 = pd.DataFrame()
with open("datos_titulos_inline.csv", "r") as f:
    for line in f:
        r = re.findall(regx, line)
        if r:
            df1 = df1.append([[r[0][0], r[1][1], r[2][2]]])

df1 = df1.rename(columns={0: "Title", 1: "Head", 2: "TEXT"}).reset_index(drop=True)

with pd.option_context('display.max_colwidth', 25):
    print(df1)

有关内联text字段，请参见regex demo

输出数据帧：

             Title                      Head                      TEXT
0           Albedo                     Trees   Because forests gene...
1           Albedo          Human activities   Human activities (e....
2  Abraham Lincoln  U.S. House of Represe...   [[File:Abraham Linco...

相关问题更多 >

编程相关推荐

热门问题

热门文章