NoneChucks是一个库,它为Pythorch的数据集、采样器和转换提供包装器,允许动态丢弃不需要的或无效的样本。

nonechucks的Python项目详细描述


非卡盘

nonechucks是一个为pytorch的数据集、采样器和转换提供包装器的库,允许动态丢弃不需要的或无效的样本。


Introduction

What if you have a dataset of 1000s of images, out of which a few dozen images are unreadable because the image files are corrupted? Or what if your dataset is a folder full of scanned PDFs that you have to OCRize, and then run a language detector on the resulting text, because you want only the ones that are in English? Or maybe you have an ^{}, and you want to be able to move to ^{} after ^{} fails while attempting to load!

PyTorch's data processing module expects you to rid your dataset of any unwanted or invalid samples before you feed them into its pipeline, and provides no easy way to define a "fallback policy" in case such samples are encountered during dataset iteration.

Why do I need it?

You might be wondering why this is such a big deal when you could simply ^{} out samples before sending it to your PyTorch dataset or sampler! Well, it turns out that it can be a huge deal in many cases:

  1. When you have a small fraction of undesirable samples in a large dataset, or
  2. When your sample-loading operation is expensive, or
  3. When you want to let downstream consumers know that a sample is undesirable (with nonechucks, transforms are not restricted to modifying samples; they can drop them as well),
  4. When you want your dataset and sampler to be decoupled.

In such cases, it's either simply too expensive to have a separate step to weed out bad samples, or it's just plain impossible because you don't even know what constitutes as "bad", or worse - both!

nonechucks allows you to wrap your existing datasets and samplers with "safe" versions of them, which can fix all these problems for you.

Examples

1. Dealing with bad samples

Let's start with the simplest use case, which involves wrapping an existing ^{} instance with ^{}.

Create a dataset (the usual way)

Using something like torchvision's ImageFolder数据集类,我们可以为一个典型的监督分类任务加载标记图像的整个文件夹。

importtorchvision.datasetsasdatasetsfruits_dataset=datasets.ImageFolder('fruits/')

不带非卡盘

现在,如果您的fruits/apple/143.jpg(已损坏)文件夹中有一个鬼鬼祟祟的fruits/,为了避免整个管道意外失败,您必须使用以下方法:

importrandom# Shuffle datasetindices=list(range(len(fruits_dataset))random.shuffle(indices)batch_size=4foriinrange(0,len(indices),batch_size):try:batch=[fruits_dataset[idx]foridxinindices[i:i+batch_size]]# Do something with itpassexceptIOError:# Skip the entire batchcontinue

您不仅需要将代码放入一个额外的try-except块中,而且还必须使用for循环,这就剥夺了pytorch内置的DataLoader功能,这意味着您不能为数据集使用批处理、洗牌、多处理和自定义采样器等功能。

我不知道你的情况,但如果不能做到这一点,我就无法使用数据处理模块。

带非卡盘

您可以用一行代码将数据集转换为SafeDataset

importnonechucksasncfruits_dataset=nc.SafeDataset(fruits_dataset)

就这样!说真的。

还不止这些。您还可以在上面使用DataLoader

dataloader=nc.SafeDataLoader(fruits_dataset,batch_size=4,shuffle=True)fori_batch,sample_batchedinenumerate(dataloader):# Do something with itpass

在这种情况下,SafeDataset将跳过错误的图像,并使用下一个图像代替它(而不是删除整个批处理)。

2.使用转换作为过滤器!

pytorch中transorms的功能仅限于修改样本。使用nonecucks,您可以简单地从转换的__call__方法返回None(或引发异常),nonecucks将为您从数据集中删除样本,允许您使用转换作为筛选器!

例如,我们假设一个PDFDocumentsDataset,它从文件夹中读取pdf文件,一个PlainTextTransform,它将文件转换为原始文本,一个LanguageFilter,它只保留特定语言的文档。

classLanguageFilter:def__init__(self,language):self.language=languagedef__call__(self,sample):# Do machine learning magicdocument_language=detect_language(sample)ifdocument_language!=self.language:returnNonereturnsampletransforms=transforms.Compose([PlainTextTransform(),LanguageFilter('en')])en_documents=PDFDocumentsDataset(data_dir='pdf_files/',transform=transforms)en_documents=nc.SafeDataset(en_documents)

Installation

To install nonechucks, simply use pip:

^{}

or clone this repo, and build from source with:

^{}.

Contributing

All PRs are welcome.

Licensing

nonechucks is MIT licensed

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
spring boot+react+mysql连接器上的java构建失败?   java如何从动态标题/文本中仅提取数字?   Eclipse java类在方法注释后插入新行   java是否在makefiles文档中指定了两次编译?   java在Spring拦截url配置中,ROLE_USER和ROLE_ANONYMOUS之间有什么区别?   sql上相同参数的java多值   java将安卓应用程序连接到本地MS SQL数据库   java在运行时收到谓词函数的名称时,如何将一个函数作为谓词传递给另一个函数?   java Lambda输入文件到数组中   java如何在数组中搜索元素?以及如何将声明了方法的变量添加到数组列表中?   java如何按列对分布在众多json文件中的数据进行分组   安装pyjnius时发生java错误,未找到“jni.h”。(操作系统X 10.10.1)   java Android:CursorIndexOutOfBoundsException:请求索引0,大小为0,站点上没有答案   java My应用程序不回退IBM MQ消息   JAVAutil。扫描器Java扫描器或缓冲读取器   java缩短开关盒方法   java获取当前文件夹中的可用空间   Java属性绑定   java如何制作注释类型的AnnotationProcessor测试*子类型*?