如何在python中创建一个数组,用于在字符串中搜索特定的标记并将输出放入return

2024-04-24 00:13:48 发布

您现在位置:Python中文网/ 问答频道 /正文

目前我有以下python代码:

def get(x):
    up, up1, up2, up3, up4 = "" ,"" ,"","" , ""


    x = x.split(", ")
    for i in x:
        if "Up_" in i:
            # print(i)
            up = str(i) + ', '
        if "Up1_" in i:
            # print(i)
            up1 = str(i) + ', '
        if "Up2_" in i:
            # print(i)
            up2 = str(i) + ', '
        if "Up3_" in i:
            # print(i)
            up3 = str(i) + ', '
        if "Up4_" in i:
            # print(i)
            up4 = str(i) + ', '

    return (str(up) + str(up1) + str(up2) + str(up3) + str(up4))[:-2]

尽管这个函数对于我目前所拥有的很好,但是如果要添加的任何标签包含从5到10的内容,那么这个函数将停止工作。你知道吗

我要做的是创建一个函数,在“tags”列中搜索任何包含“Up”&;“Up*.*”的标记(在SQL术语中,将返回任何值介于Up&;之间的内容)。不确定python中是否有这方面的功能)然后将数组找到的任何内容放在另一个只包含Up和Up*\标记的数组中,然后将其应用于另一列。你知道吗

+---+----------+-------+------------+-----------+--------------+
| product_id |  sku  | total_sold |   tags    | total_images |
+---+----------+-------+------------+-----------+--------------+
| geggre     | rgerg |        456 | Up1_, Up2 |            5 |
+---+----------+-------+------------+-----------+--------------+

希望它看起来像:

+---+----------+-------+------------+-----------+--------------+-------+
| product_id |  sku  | total_sold |   tags    | total_images | Count |
+---+----------+-------+------------+-----------+--------------+-------+
| ggeggre    | rgerg |        456 | Up1_, Up2 |            5 |     2 |
+---+----------+-------+------------+-----------+--------------+-------+

感谢另一位用户,我已经有了count标签:

data[“total_tags”]=data[“tags”].apply(lambda x:len(x.split(','))

我只需要知道如何创建上面的数组来简化if语句,并让它包含多达10个标记。你知道吗

另外,这是我的python,它使用get并附加“tags”列以仅包括Up标记:

data['tags'] = data['tags'].apply(get)

上下文的完整脚本:


# impoting padas module with an alias of pd
import pandas as pd


# get function assigned to x (x values: up, up1, up2, up3, up4)
def get(x):
    up, up1, up2, up3, up4 = "" ,"" ,"","" , ""


    x = x.split(", ")
    for i in x:
        if "Up_" in i:
            # print(i)
            up = str(i) + ', '
        if "Up1_" in i:
            # print(i)
            up1 = str(i) + ', '
        if "Up2_" in i:
            # print(i)
            up2 = str(i) + ', '
        if "Up3_" in i:
            # print(i)
            up3 = str(i) + ', '
        if "Up4_" in i:
            # print(i)
            up4 = str(i) + ', '
    # returns the values within a string if each maches, it also removed -2 characters    
    return (str(up) + str(up1) + str(up2) + str(up3) + str(up4))[:-2]
# data contains the content of the data200.csv file using pandas read_csv function
data = pd.read_csv('data200.csv')

#defines the tags column to equal what up_ tags are in the tags column using the get function
data['tags'] = data['tags'].apply(get)

#
data = data[ (data['tags'] == "") == False]

#creates a new column called total_tags and returns a count of how many elements are between commas
data["total_tags"] = data["tags"].apply(lambda x : len(x.split(',')))

# prints first 5 lines of csv
print(data.head())
# exports everything to test.csv and removes the index column
data.to_csv("test.csv", index = False)

Tags: csvindatagetiftagstotalprint
1条回答
网友
1楼 · 发布于 2024-04-24 00:13:48

可以使用正则表达式:

import re

def get(x):
    x = x.split(", ")
    out_str = ''
    for tag in x:
        if re.search("^Up\d*_", tag):
            t = re.match("^Up\d*_", tag)
            t = t.group(0)
            out_str += t + ','
    return out_str[:-1]
print(get("Up1_, AS3_, Up2_, Up_, AS_"))

输出:

Up1_,Up2_,Up_

这就是你要找的吗?如果您只需要标记中的数字0-9,可以将regex中的*更改为?

if re.search("^Up\d?_", tag):
     t = re.match("^Up\d?_", tag)

编辑:

在你编辑之后,我更明白你的意思,你可以简单地做:

data['tags'] = data['tags'].apply(lambda x : ",".join(re.findall("Up\d*_", x)))

或:

data['tags'] = data['tags'].apply(lambda x : ",".join(re.findall("Up\d?_", x)))

取决于在Up_之间最多只需要一个数字,或者是否允许任何数字。请注意,在findall()方法中,^被删除,因为我们不仅从字符串的开头搜索,而且在整个字符串中搜索所有出现的情况。你知道吗

编辑2:

好吧,总结一下这些评论和从这些评论中获得的附加信息,你可能想要这样的东西:

data['tags'] = data['tags'].apply(lambda x : ",".join(re.findall("[Uu]p\d?_\S*(?=,)", x)))

相关问题 更多 >