对于大数据集（2变量200百万）进行逻辑回归的高效方法是什么？

5 投票

1 回答

3756 浏览

提问于 2025-04-18 15:24

我现在正在尝试运行一个逻辑回归模型。我的数据有两个变量，一个是响应变量，另一个是预测变量。问题是我有2亿条观察数据。我想在R、Stata或MATLAB中运行这个逻辑回归模型，但即使使用亚马逊的EC2实例，我也遇到了很大的困难。我觉得问题出在这些编程语言中逻辑回归函数的定义上。有没有其他方法可以快速运行逻辑回归呢？目前我面临的问题是我的数据很快就占满了它所使用的空间。我甚至尝试使用了30GB的内存，但还是没有效果。任何解决方案都非常欢迎。

大数据处理数据分析内存优化高性能计算机器学习统计建模逻辑回归

1 个回答

如果你主要的问题是因为电脑内存限制而无法估算logit模型，而不是估算的速度，你可以利用最大似然估计的可加性，写一个自定义程序来使用ml。logit模型其实就是用逻辑分布进行的最大似然估计。因为你只有一个自变量，这个问题就简单多了。我在下面模拟了这个问题。你需要把以下代码块分成两个do文件。

如果你没有问题加载整个数据集——其实你应该没问题，我的模拟只用了大约2GB的内存，处理了2亿条记录和2个变量，具体情况可能会有所不同——第一步是把数据集拆分成更小的部分。比如：

depvar = 你的因变量（0或1）
indepvar = 你的自变量（一些数字类型的数据）

cd "/path/to/largelogit"

clear all
set more off

set obs 200000000

// We have two variables, and independent variable and a dependent variable.
gen indepvar = 10*runiform()
gen depvar = .

// As indpevar increases, the probability of depvar being 1 also increases.
replace depvar = 1 if indepvar > ( 5 + rnormal(0,2) )
replace depvar = 0 if depvar == .

save full, replace
clear all

// Need to split the dataset into managable pieces

local max_opp = 20000000    // maximum observations per piece

local obs_num = `max_opp'

local i = 1
while `obs_num' == `max_opp' {

    clear

    local h = `i' - 1

    local obs_beg = (`h' * `max_opp') + 1
    local obs_end = (`i' * `max_opp')

    capture noisily use in `obs_beg'/`obs_end' using full

    if _rc == 198 {
        capture noisily use in `obs_beg'/l using full
    }
    if _rc == 198 { 
        continue,break
    }

    save piece_`i', replace

    sum
    local obs_num = `r(N)'

    local i = `i' + 1

}

接下来，为了减少内存使用，关闭Stata然后重新打开。当你创建这么大的数据集时，即使你清空了数据集，Stata仍然会保留一些内存用于管理等。你可以在执行save full和clear all后输入memory来查看我说的意思。

然后你需要定义自己的自定义ml程序，这个程序会逐个处理这些小数据块，计算每个观察值的对数似然值并将它们加在一起。你需要使用d0 ml method而不是lf方法，因为lf方法的优化过程需要所有数据都加载到Stata中。

clear all
set more off

cd "/path/to/largelogit"

// This local stores the names of all the pieces 
local p : dir "/path/to/largelogit" files "piece*.dta"

local i = 1
foreach j of local p {    // Loop through all the names to count the pieces

    global pieces = `i'    // This is important for the program
    local i = `i' + 1

}

// Generate our custom MLE logit progam. This is using the d0 ml method 

program define llogit_d0

    args todo b lnf 

    tempvar y xb llike tot_llike it_llike

quietly {

    forvalues i=1/$pieces {

        capture drop _merge
        capture drop depvar indepvar
        capture drop `y'
        capture drop `xb'
        capture drop `llike' 
        capture scalar drop `it_llike'

        merge 1:1 _n using piece_`i'

        generate int `y' = depvar

        generate double `xb' = (indepvar * `b'[1,1]) + `b'[1,2]    // The linear combination of the coefficients and independent variable and the constant

        generate double `llike' = .

        replace `llike' = ln(invlogit( `xb')) if `y'==1    // the log of the probability should the dependent variable be 1
        replace `llike' = ln(1-invlogit(`xb')) if `y'==0   // the log of the probability should the dependent variable be 0

        sum `llike' 
        scalar `it_llike' = `r(sum)'    // The sum of the logged probabilities for this iteration

        if `i' == 1     scalar `tot_llike' = `it_llike'    // Total log likelihood for first iteration
        else            scalar `tot_llike' = `tot_llike' + `it_llike' // Total log likelihood is the sum of all the iterated log likelihoods `it_llike'

    }

    scalar `lnf' = `tot_llike'   // The total log likelihood which must be returned to ml

}

end

//This should work

use piece_1, clear

ml model d0 llogit_d0 (beta : depvar = indepvar )
ml search
ml maximize

我刚刚运行了上面的两个代码块，得到了以下输出：

Large Logit Output

这种方法的优缺点：

优点：

‘max_opp’的大小越小，内存使用就越低。我在使用模拟器时从未超过大约1GB。

你得到的是无偏估计量，整个数据集的完整对数似然估计，以及正确的标准误差——基本上所有重要的推断信息。

缺点：

你节省的内存会在CPU时间上有所牺牲。我在个人笔记本上用Stata SE（单核）和i5处理器运行这个程序，花了我一整夜的时间。

Wald Chi2统计量是错误的，但我相信你可以根据上面提到的正确数据来计算它。

你不会像使用logit那样得到Psudo R2。

要测试系数是否真的和标准logit相同，可以把set obs设置为相对较小的值，比如100000，然后把max_opp设置为1000。运行我的代码，查看输出，再运行logit depvar indepvar，查看输出，除了我在“缺点”中提到的内容，它们是相同的。将obs设置为和max_opp相同会修正Wald Chi2统计量。

回答于 2025-04-18 由 Python大师

分享举报

对于大数据集（2变量200百万）进行逻辑回归的高效方法是什么？

1 个回答

撰写回答