Python sample-lines包_程序模块 - PyPI

文件中的采样线。

sample-lines的Python项目详细描述

已写入文件中的示例行。

安装

像这样安装。

pip install sample-lines

如何

有关文档，请参阅帮助。

sample-lines -h
usage: Randomly select lines from a file. [-h] [--sample-size N]
                                          [--method {simple-random,systematic}]
                                          [--repeat REPEAT]
                                          file

positional arguments:
  file

optional arguments:
  -h, --help            show this help message and exit
  --sample-size N, -n N
                        Number of lines to emit
  --method {simple-random,systematic}, -m {simple-random,systematic}
                        Sampling method
  --repeat REPEAT, -r REPEAT
                        Number of repetitions for systematic sampling

样本被替换，并按行长加权。可能性选择一条线的长度与前一条线的长度成正比。这样我们就可以很快地进行采样，但只有在文件具有相当一致的行长度，或者至少在没有线长度的周期性变化。

多快

考虑一下这个1GB的CSV文件。

$ wc big-file.csv
 2388430 27673790 1071895374 big-file.csv

运行wc需要三秒钟。

time wc big-file.csv
 2388430 27673790 1071895374 big-file.csv

real    0m3.789s
user    0m3.560s
sys     0m0.190s

下面是解析整个文件所需的时间。

$ time python3 -c 'for line in open("big-file.csv"): pass'

real    0m2.892s
user    0m2.641s
sys     0m0.245s

sample-lines更快。这是一个40行的简单随机样本，

$ time sample-lines -n 40 -m simple-random big-file.csv > /dev/null

real    0m0.136s
user    0m0.113s
sys     0m0.018s

40行的系统样本，

$ time sample-lines -n 40 -m systematic -r 4 big-file.csv > /dev/null

real    0m0.148s
user    0m0.122s
sys     0m0.019s

以及重复系统样本，每个重复4次，10行，用于总共40行。

$ time sample-lines -n 10 -m systematic -r 4 big-file.csv > /dev/null

real    0m0.175s
user    0m0.140s
sys     0m0.025s

在上面的示例中，大部分时间都花在加载python和各种模块；打印帮助几乎需要运行示例的时间。

$ time sample-lines -h > /dev/null

real    0m0.157s
user    0m0.129s
sys     0m0.021s

因此，即使是一个相当大的样本仍然很快运行。

$ time sample-lines -n 2000 -m systematic -r 50 big-file.csv > /dev/null

real    0m2.695s
user    0m2.435s
sys     0m0.231s

备选方案

使用sample 如果你想从溪流中取样。

欢迎加入QQ群-->： 979659372

sample-lines 0.0.4

sample-lines的Python项目详细描述

安装

如何

多快

备选方案

推荐PyPI第三方库

xps

f5-admin

np-xarr

hybridtfidf

OptGBM

aiolo

anatolygusev-djet

datasette-insert-unsafe

flask-sqlalchemy-stubs

moz-crlite-lib

onesocial-django

cerberus-api-client

odoo13-addon-sale-product-pack

tdbuild

pretalx-public-voting

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

sample-lines 0.0.4

sample-lines的Python项目详细描述

安装

如何

多快

备选方案

推荐PyPI第三方库

xps

f5-admin

np-xarr

hybridtfidf

OptGBM

aiolo

anatolygusev-djet

datasette-insert-unsafe

flask-sqlalchemy-stubs

moz-crlite-lib

onesocial-django

cerberus-api-client

odoo13-addon-sale-product-pack

tdbuild

pretalx-public-voting

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签