Python正则表达式从具有各种结构的文件中提取数据

Detected 3 gas in sample. Composition :\r\n Very low Helium (1.5% total)\r\n Medium Oxygen (20% total)\r\n Low Nitrogen (6.5% total)\r\n Detected 0 gas in sample. Composition :\r\n Detected 2 gas in sample. Composition :\r\n Low Carbon monoxide (5% total)\r\n Very high Helium (80% total)\r\n Traces of Oxygen\r\n Detected 1 gas in sample. Composition :\r\n Medium Nitrogen (18.5% total)\r\n Traces of Helium, Argon\r\n

2条回答

网友

1楼 · 编辑于 2024-05-16 07:24:17

使用无气体/观测文本静态列表的正则表达式的方法

如果每行都被解析，那么解析文本就更简单了。有两种线结构
1. 样本标题，其中提取了检测的
2. 样本详细信息，包含气体名称、气体文本和百分比

import re

text = """Detected 3 gas in sample. Composition :\r\n Very low Helium (1.5% total)\r\n Medium Oxygen (20% total)\r\n Low Nitrogen (6.5% total)\r\n
Detected 0 gas in sample. Composition :\r\n
Detected 2 gas in sample. Composition :\r\n Low Carbon monoxide (5% total)\r\n Very high Helium (80% total)\r\n Traces of Oxygen\r\n
Detected 1 gas in sample. Composition :\r\n Medium Nitrogen (18.5% total)\r\n Traces of Helium, Argon\r\n"""

# keep all lines separate,  it's simpler to parse...
df = pd.DataFrame(re.split("\r\n\n?", text), columns=["text"]).replace("",np.nan).dropna()

# extract number of samples and assign a sample#
df = df.assign(main=df.text.str.contains("Detected"),
          sample=lambda dfa: dfa.main.cumsum(),
          detected=lambda dfa: np.where(dfa.main, dfa.text.str.extract(r'([0-9])', expand=False), np.nan),
         ).fillna(method="ffill")

# extract the gas, gas text, gas %age from each of the samples
# where gases are comma-separated generate list and explode()
df2 = (df.join(df.text.str.extract(r'(?P<txt>[V,M,L,T][a-z, ]*)(?P<gas>[A-Z,a-z \,]*)\(?(?P<pct>\d*\.?\d*)'))
       .assign(gas=lambda dfa: dfa.gas.str.strip().str.split(", "))
       .explode("gas")
      ).rename(columns={"pct":"%"})


# reshape structure of samples and name columns
df2 = df2.loc[~df2.main, ["sample","gas","txt","%"]].set_index(["sample","gas"]).unstack(1)
df2.columns= [f"{tup[1]} ({tup[0]})" for tup in df2.columns]

# finally pull it all together
df.loc[df.main, ["sample","detected"]].merge(df2, on="sample", how="left").replace(np.nan, "")

输出

^{tb1}$

网友
2楼 · 编辑于 2024-05-16 07:24:17

这是一个非正则表达式的解决方案（但它依赖于字符串中的换行符保存为文件中的字符串，请参见Armanli的注释）。不需要正则表达式，因为字符串具有类似的结构。此解决方案循环文件中的行，在\\r\\n上拆分，并从列表中提取Detected、Traces或任何气体。它将值保存在可加载到熊猫中的DICT列表中：
import numpy as np import pandas as pd gasses = ['Helium', 'Oxygen', 'Nitrogen', 'Carbon monoxide', 'Argon'] def get_data(gas, line): return [line.split(f' {gas} (')[0].strip(), float(line.split(f' {gas} (')[1].split('%')[0])] all_data = [] with open("filename.txt", "r") as f: d = [i.split('\\r\\n') for i in f.readlines()] for i in d: tmp_dict = {} for z in i[:-1]: if 'Detected' in z: tmp_dict['Detected'] = int(z.split(" ")[1]) elif 'Traces' in z: tr = z[10:].split(', ') for t in tr: tmp_dict[f'{t.strip()} (txt)'] = 'Traces' else: gas = [ele for ele in gasses if(ele in z)] [0] r = get_data(gas, z) tmp_dict[f'{gas} (txt)'] = r[0] tmp_dict[f'{gas} (%)'] = r[1] all_data.append(tmp_dict) df = pd.DataFrame(all_data)
输出：
^{tb1}$

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章