寻找有序多类情形的分界点

2024-04-26 00:02:27 发布

您现在位置:Python中文网/ 问答频道 /正文

现在我有一个表数据,有50000行(例如表示不同的用户)和两列:一个是类“a”到“E”的标签,另一个是小数点为0.0到14000000.00的分数。 我的任务是根据第二列的分数,给用户添加一个新标签,从“a”到“E”。你知道吗

在这里,我希望尽可能减少旧标签(第一列)到新标签的更改。你知道吗

类“A”-“E”被认为是有序的,因此我首先为每个用户的更改分配“loss”:比如A->;B是loss=1.0,A->;C是loss=2.0,等等,然后在Python上尝试了一些scipy minimize函数来最小化整个用户的总损失,为A<;>;B,B<;>;C,C<;>;D,D<;>;E、 但效果不好(从我给出的初始点开始就没有改变),这并不奇怪,因为损失函数是一个阶跃函数,几乎所有地方的梯度都为零。你知道吗

这是我给你的参考资料。obj3是一个计算总损失的函数,输入一个列表,其中四个元素按此顺序表示D<;>;E、C<;>;D、B<;>;C、a<;>;B的截止点。因为D<;>;E的截止点总是小于C<;>;D的截止点,以此类推,所以我使用了三个约束条件:

b = (0,max(df['score']))
bounds = (b,b,b,b)

constraints = [
    {'type': 'ineq', 'fun': lambda x: x[1] - x[0]},
    {'type': 'ineq', 'fun': lambda x: x[2] - x[1]},
    {'type': 'ineq', 'fun': lambda x: x[3] - x[2]},
]

sol = minimize(obj3, thres_init, method='SLSQP', bounds=bounds, constraints=constraints) 
sol.x

旧标签和新标签可以有一些洗牌(如按分数降序排列,旧标签类似于'aabdeabd…',一次性暴力方法可能不一定能很好地工作,我想(还没有尝试)

使用排序的唯一得分列表的索引(有5000个唯一得分),我还想到了一些元启发式方法,比如GA(因为现在我们的搜索空间更小,但它是组合优化),但我不确定是否值得投入时间。你知道吗

有人知道解决这个优化问题的算法吗??你知道吗

先谢谢你。你知道吗


Tags: lambda函数用户ltgttype标签分数
1条回答
网友
1楼 · 发布于 2024-04-26 00:02:27

考虑值为0->5的类A->E。你知道吗

Y为按递增顺序排序的数组得分

例如:Y = [1,2,5,...,14*1e6]

X为赋值前的关联值类

例如:X = [0,5,2,3,...1]

假设我们想要在c_ii = 0 to 3c_j > c_i if j > i位置进行4切割

给定的c_i组合的成本是sum_i( |X[0:c_i] - i| )

我们可以编写非常愚蠢的最小化函数(不是有效的python)

#the cost of a class starting from start to end
def cost(start, end, label):
    return sum(abs(X[old: end] - label]))

# old is the starting index
# n is the number of cuts we still have to place
def rec(old, n):
    if n == 0:
        # we can't place more cuts, so it is our final class
        # the cost is the diff for that class "move"
        return cost(old, len(X), n)

    min_cost = Infinity
    # we check every position possible for our next cut
    # note that I am very likely not to be idx accurate
    for c_i in range(old, len(X)):
        rec_cost = rec(c_i+1, n-1)
        current_class_cost = cost(old, c_i+1, n)
        total_cost = rec_cost + current_class_cost

        # we update the min_cost on a better found solution
        # we can trivially backtrack the all the c_i's position as well
        min_cost = min(min_cost, total_cost)

    return total_cost

相关的动态方法如下

take the first cut as label_0
for all positions
    generate the associated cost of that cut for that position. 
    (you could reuse the cost of the previous calculated position, but it is not the point)

take the next cut (as label_1), for all generated (previous) positions as p, 
    for all positions as new_position > p
        put the new_position's cost as min(new_position's cost, p + cost from p to that new_position)

do so until first cut is placed
At that stage, you must for every position add the remaining class cost

空间复杂度:O(n\u分数) 时间复杂度:O(n\u削减*n\u得分*n\u得分)

下面的似乎起作用,但也没有真正测试过。所以要小心。。。你知道吗

X=[0,0,1,1,0,1 ,1, 2, 2, 2 ,1]
Y=[1,2,5,7,8,10,11,11,11,11,12] #just useless

def cost(left, right, label):
    s = 0
    for i in range(left, right+1):
        s += abs(X[i]-label)
    return s

def dp():
    n_scores = len(X)
    positions = [0]*n_scores
    for pos in range(n_scores):
        positions[pos] = {'cost':cost(0, pos, 0), 'cuts':[pos]}

    for label_cut in range(1, max(X)):

        next_positions = [{ 'cost':1e5, 'cuts':[]} for i in range(n_scores)]

        for pos in range(n_scores):
            p = positions[pos]
            for cut in range(pos+1, n_scores-1):
                c = cost(pos+1, cut, label_cut)
                total_cost = c + p['cost']
                next_p = next_positions[cut]
                if total_cost < next_p['cost']:
                    next_positions[cut] = {'cost': total_cost, 'cuts': p['cuts'] + [cut] }
        positions = next_positions

    #print the position for which the cost is minimal
    best = {'cost':1e5}
    for pos in range(n_scores):
        p = positions[pos]
        total_cost = p['cost'] + cost(pos+1, n_scores-1, max(X))
        if total_cost < best['cost']:
            best = {
                'cuts': p['cuts'],
                'cost': total_cost
            }
    return best

相关问题 更多 >