我如何使用正则表达式模式从结尾开始的第一句话,用python写出最后一句话

2024-04-23 20:33:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我有netx文本:

"TITULO: Albedo SUBTITULO Y PARRAFO: ===Trees===
Because forests generally have a low albedo, (the majority of the ultraviolet and [[visible spectrum]] is absorbed through [[photosynthesis]])
"

"TITULO: Albedo SUBTITULO Y PARRAFO: ==Human activities==
Human activities (e.g., deforestation, farming, and urbanization) change the albedo of various areas around 
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ===U.S. House of Representatives, 1847–1849===
[[File:Abraham Lincoln by Nicholas Shepherd, 1846-crop.jpg|thumb|upright|alt=Middle 

我想在python中使用正则表达式创建数据帧:

Tile                   Head                          TEXT
Albedo                 Trees                         Because forests generally have a low  ...([[photosynthesis]])
Albedo                 Human activities              Human activities (e.g., de...areas around 
Abraham Lincoln        U.S. House of..1849           [[File:Abraham Lincoln by... line Whig,

我有这个代码,第一列和第二列都有效,但第三列我不知道 如何从last==或===或===forward获取?也就是说

由于森林的反照率通常较低,(大部分紫外线和[[可见光谱]]通过[[光合作用]]吸收)

人类活动(如森林砍伐、农业和城市化)改变了周围各个地区的反照率

[[文件:尼古拉斯·谢泼德的亚伯拉罕·林肯,1846年crop.jpg | thumb | right | alt=Middle

import re
from collections import defaultdict
import pandas as pd

pandas_dict = defaultdict(list)

with open("datos_titulos.csv", "r") as f:
    for line in f:


        pat = r"TITULO: (.*) SUBTITULO Y PARRAFO: ==(.*?)==|rTITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)==="
        pat2 = r"TITULO: (.*) SUBTITULO Y PARRAFO: ==(.*?)==$|rTITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)===$"

        if re.search(pat, line) :

            pandas_dict["title"].append(re.search(pat, line).group(1))
            pandas_dict["head"].append(re.search(pat, line).group(2))
        if re.search(pat2, line) :

            pandas_dict["text"].append(re.search(pat2, line).group(2))
df = pd.DataFrame(pandas_dict) 

Tags: ofrepandassearchlinedictactivitieshuman
1条回答
网友
1楼 · 发布于 2024-04-23 20:33:26
import re
from collections import defaultdict
import pandas as pd

pandas_dict = defaultdict(list)

regx_title = r"TITULO: (.*) SUBTITULO"
regx_head = r"={2,}(.*?)={2,}"
regx_text = r"^(?!\")(.+)"

regex_list = [regx_title, regx_head, regx_text]

with open("datos_titulos.csv", "r") as f:
    for line in f:
        for i, regx in enumerate(regex_list):
            r = re.findall(regx, line)
            if r:
                pandas_dict[i].append(r[0])

df = pd.DataFrame(pandas_dict)
df = df.rename(columns={0:"Title", 1:"Head", 2:"TEXT"})

with pd.option_context('display.max_colwidth', 25):
    print(df)

细节

  • "TITULO: (.*) SUBTITULO"

    • (.*)-捕获组:除换行符以外的任何0个或更多字符
  • ={2,}(.*?)={2,}

    • ={2,}-两个或多个字符=
    • (.*?)-除换行符以外的任何0个或更多字符(尽可能少的次数)
    • ={2,}-两个或多个字符=
  • ^(?!\")(.+)

    • ^-行的开始
    • (?!-Negative Lookahead(断言下面的正则表达式不匹配)
      • \"-按字面意思匹配字符"
    • )-关闭负前瞻
    • (.+)-捕获组:除换行符以外的任何1个或多个字符

此外,最后一个正则表达式捕获任何不以"开头的行,因此请确保 所需的text字段是该行中除"之外唯一开始的内容。从您的示例文本 这在提供的输出中可以看到,但如果出于任何原因,text字段 在同一行中,您可以使用以下正则表达式:

TITULO: (.*) SUBTITULO|={2,}(.*?)={2,}|(?<===)(.+)

使用此代码:

import pandas as pd
import re

regx = r"TITULO: (.*) SUBTITULO|={2,}(.*?)={2,}|(?<===)(.+)"

df1 = pd.DataFrame()
with open("datos_titulos_inline.csv", "r") as f:
    for line in f:
        r = re.findall(regx, line)
        if r:
            df1 = df1.append([[r[0][0], r[1][1], r[2][2]]])

df1 = df1.rename(columns={0: "Title", 1: "Head", 2: "TEXT"}).reset_index(drop=True)

with pd.option_context('display.max_colwidth', 25):
    print(df1)

有关内联text字段,请参见regex demo

输出数据帧:

             Title                      Head                      TEXT
0           Albedo                     Trees   Because forests gene...
1           Albedo          Human activities   Human activities (e....
2  Abraham Lincoln  U.S. House of Represe...   [[File:Abraham Linco...

相关问题 更多 >