如何使用Python以结构化格式将半结构化文本文件加载到数据帧中?

2024-06-11 10:03:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我有多个从NoSQL数据库中提取的txt文件。示例半结构化文件如下所示:

HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.3.1, rUnknown, Wed Dec 13 22:58:54 UTC 2017

scan 'abcd.polardim', {TIMERANGE => [0, 1583020799000]}
ROW  COLUMN+CELL     
 abcd.polardim|1511175034223 column=i:SJ - #3, timestamp=1511175034224, value=9
 abcd.polardim|1511175034223 column=i:SJ - #4, timestamp=1511175034224, value=1
 abcd.polardim|1511175034223 column=i:SJ Best, timestamp=1511175034224, value=15
 abcd.polardim|1511175034223 column=i:TestMoment, timestamp=1511175034224, value=jan-17
 row|1518803776714 column=i:Emulate, timestamp=1518803776720, value=fa283e60-db7e-4888-80f8-2688b36c1234
 row|1518803776714 column=i:CSF - #1, timestamp=1518803776720, value=0
 row|1518803776714 column=i:CSF - #2, timestamp=1518803776720, value=0
 row|1518803776714 column=i:CSF - #3, timestamp=1518803776720, value=0
 row|1518803776714 column=i:CSF - #4, timestamp=1518803776720, value=0
 row|1518803776714 column=i:CSF Best, timestamp=1518803776720, value=0
 row|1518803776714 column=i:Categ, timestamp=1518803776720, value=M
 row|1518803776714 column=i:Cy, timestamp=1518803776720, value=192
 row|1518803776714 column=i:Comments, timestamp=1518803776720, value=0
 row|1518803776714 column=i:Date, timestamp=1518803776720, value=17-2-2009

我想把它加载到一个数据框中,在=后面的文本作为一个值加载到一个字段中。 示例输出如下所示:

column      timestamp       value
SJ - #3     1511175034224   9
SJ - #4     1511175034224   1
SJ Best     1511175034224   15
TestMoment  1511175034224   jan-17
Emulate     1518803776720   fa283e60-db7e-4888-80f8-2688b36c1234
CSF - #1    1518803776720   0

如何在python中实现这一点


Tags: 文件示例returnvaluecolumnshelltimestamprow
1条回答
网友
1楼 · 发布于 2024-06-11 10:03:09

您可以为此使用re模块re.finditer这里将返回一个迭代器,该迭代器在字符串中的RE模式的所有非重叠匹配上生成匹配对象

import re
import pandas as pd

source = """
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.3.1, rUnknown, Wed Dec 13 22:58:54 UTC 2017

scan 'abcd.polardim', {TIMERANGE => [0, 1583020799000]}
ROW  COLUMN+CELL     
 abcd.polardim|1511175034223 column=i:SJ - #3, timestamp=1511175034224, value=9
 abcd.polardim|1511175034223 column=i:SJ - #4, timestamp=1511175034224, value=1
 abcd.polardim|1511175034223 column=i:SJ Best, timestamp=1511175034224, value=15
 abcd.polardim|1511175034223 column=i:TestMoment, timestamp=1511175034224, value=jan-17
 row|1518803776714 column=i:Emulate, timestamp=1518803776720, value=fa283e60-db7e-4888-80f8-2688b36c1234
 row|1518803776714 column=i:CSF - #1, timestamp=1518803776720, value=0
 row|1518803776714 column=i:CSF - #2, timestamp=1518803776720, value=0
 row|1518803776714 column=i:CSF - #3, timestamp=1518803776720, value=0
 row|1518803776714 column=i:CSF - #4, timestamp=1518803776720, value=0
 row|1518803776714 column=i:CSF Best, timestamp=1518803776720, value=0
 row|1518803776714 column=i:Categ, timestamp=1518803776720, value=M
 row|1518803776714 column=i:Cy, timestamp=1518803776720, value=192
 row|1518803776714 column=i:Comments, timestamp=1518803776720, value=0
 row|1518803776714 column=i:Date, timestamp=1518803776720, value=17-2-2009
 """

new_dict = {"column": [], "timestamp": [], "value": []}

for m in re.finditer(r"column=(.*?),\s*timestamp=(.*?),\s*value=(.*)", source):
    new_dict["column"].append(m.group(1))
    new_dict["timestamp"].append(m.group(2))
    new_dict["value"].append(m.group(3))
df = pd.DataFrame(new_dict)

另一个选项是从source并使用extract方法创建一个示例数据帧

>>> import pandas as pd
>>> 
>>> source = """
... HBase Shell; enter 'help<RETURN>' for list of supported commands.
... Type "exit<RETURN>" to leave the HBase Shell
... Version 1.3.1, rUnknown, Wed Dec 13 22:58:54 UTC 2017
... 
... scan 'abcd.polardim', {TIMERANGE => [0, 1583020799000]}
... ROW  COLUMN+CELL     
...  abcd.polardim|1511175034223 column=i:SJ - #3, timestamp=1511175034224, value=9
...  abcd.polardim|1511175034223 column=i:SJ - #4, timestamp=1511175034224, value=1
...  abcd.polardim|1511175034223 column=i:SJ Best, timestamp=1511175034224, value=15
...  abcd.polardim|1511175034223 column=i:TestMoment, timestamp=1511175034224, value=jan-17
...  row|1518803776714 column=i:Emulate, timestamp=1518803776720, value=fa283e60-db7e-4888-80f8-2688b36c1234
...  row|1518803776714 column=i:CSF - #1, timestamp=1518803776720, value=0
...  row|1518803776714 column=i:CSF - #2, timestamp=1518803776720, value=0
...  row|1518803776714 column=i:CSF - #3, timestamp=1518803776720, value=0
...  row|1518803776714 column=i:CSF - #4, timestamp=1518803776720, value=0
...  row|1518803776714 column=i:CSF Best, timestamp=1518803776720, value=0
...  row|1518803776714 column=i:Categ, timestamp=1518803776720, value=M
...  row|1518803776714 column=i:Cy, timestamp=1518803776720, value=192
...  row|1518803776714 column=i:Comments, timestamp=1518803776720, value=0
...  row|1518803776714 column=i:Date, timestamp=1518803776720, value=17-2-2009
... """
>>> df = pd.DataFrame(source.splitlines())
>>> new_df = (
...     df[0]
...     .str.extract(
...         r"column=(?P<column>.*?),\s*timestamp=(?P<timestamp>.*?),\s*value=(?P<value>.*)"
...     )
...     .dropna(how="all")
...     .reset_index(drop=True)
... 
... 
... )
>>> 
>>> new_df
          column      timestamp                                 value
0      i:SJ - #3  1511175034224                                     9
1      i:SJ - #4  1511175034224                                     1
2      i:SJ Best  1511175034224                                    15
3   i:TestMoment  1511175034224                                jan-17
4      i:Emulate  1518803776720  fa283e60-db7e-4888-80f8-2688b36c1234
5     i:CSF - #1  1518803776720                                     0
6     i:CSF - #2  1518803776720                                     0
7     i:CSF - #3  1518803776720                                     0
8     i:CSF - #4  1518803776720                                     0
9     i:CSF Best  1518803776720                                     0
10       i:Categ  1518803776720                                     M
11          i:Cy  1518803776720                                   192
12    i:Comments  1518803776720                                     0
13        i:Date  1518803776720                             17-2-2009

相关问题 更多 >