从数据帧中的文本检索子字符串

2024-06-16 13:04:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含以下内容的.csv文件

enter image description here

我希望能够生成一个.csv,其中包含一列以显示长颈鹿<;4位数字>;当该模式在“文本”列中可用时

到目前为止,我已经编写了下面的代码,它没有动态地/为每一行计算切片开始和结束索引(对于长颈鹿_编号)

import pandas as pd
file_path = 'test.csv'
data = pd.read_csv(file_path)
sub = "giraffe"
# column to identify if Giraffe is present
data['Giraffe_Present'] = data['text'].str.contains(sub)
# column to identify index of Giraffe in text
data["Giraffe_Index"] = data['text'].str.find(sub)
# column to identify starting position for slice
data['Giraffe_start'] = data['Giraffe_Index'].apply(lambda row: row)
# column to identify ending position for slice
data['Giraffe_end'] = data['Giraffe_Index'].apply(lambda row: row+11)
# column to store sliced Giraffe number from text
data['Giraffe_numbers'] = data['text'].apply(lambda row: row[data['Giraffe_Index'].apply(lambda row: row).max():data['Giraffe_Index'].apply(lambda row: row+11).max()])
print(data)

这是输出。结果对#2、#4和#5有偏差,因为长颈鹿#u数使用与#1对应的开始和结束索引

enter image description here


Tags: csvtopathlambdatextdataindexcolumn
2条回答

与其使用多个步骤,为什么不一次完成所有工作

data['Giraffe_numbers'] = data.apply(
    lambda row: row['text'][
        row["text"].find('giraffe') : row['text'].find('giraffe') + 11
    ]
    if row['text'].find('giraffe') > 0
    else '',
    axis=1
)

我知道这不是你所期望的,但可能会很有趣

输入数据:

>>> df
                     text
0      myname giraffe0086
1           cat whale4321
2             giraffe9064
3     poultry dolphin4356
4  fifty giraffe2345 nine
5      giraffe3434 catnap
6        nothing to catch

在字符串中查找动物和数字:

import re

# https://docs.python.org/3/library/re.html#index-15
PAT = re.compile(r'(?P<animal>\w+)(?=(?P<number>\d{4}))')

sre = df['text'].apply(PAT.search)
>>> sre
0    <re.Match object; span=(7, 14), match='giraffe'>
1       <re.Match object; span=(4, 9), match='whale'>
2     <re.Match object; span=(0, 7), match='giraffe'>
3    <re.Match object; span=(8, 15), match='dolphin'>
4    <re.Match object; span=(6, 13), match='giraffe'>
5     <re.Match object; span=(0, 7), match='giraffe'>
6                                                None
Name: text, dtype: object

使用animalstartendnumber列构建数据帧:

extract_data = lambda r: (r.group('animal'), r.start(), r.end()-4, r.group('number')

df1 = sre[sre.notnull()].apply(extract_data).apply(pd.Series) \
                        .rename(columns={0: 'animal', 1: 'start', 2: 'end', 3: 'number'})

合并dfdf1

df = pd.concat([df, df1], axis="columns")
>>> df
                     text   animal  start   end number
0      myname giraffe0086  giraffe    7.0  14.0   0086
1           cat whale4321    whale    4.0   9.0   4321
2             giraffe9064  giraffe    0.0   7.0   9064
3     poultry dolphin4356  dolphin    8.0  15.0   4356
4  fifty giraffe2345 nine  giraffe    6.0  13.0   2345
5      giraffe3434 catnap  giraffe    0.0   7.0   3434
6        nothing to catch      NaN    NaN   NaN    NaN

相关问题 更多 >