NoneChucks是一个库,它为Pythorch的数据集、采样器和转换提供包装器,允许动态丢弃不需要的或无效的样本。

nonechucks的Python项目详细描述


非卡盘

nonechucks是一个为pytorch的数据集、采样器和转换提供包装器的库,允许动态丢弃不需要的或无效的样本。


Introduction

What if you have a dataset of 1000s of images, out of which a few dozen images are unreadable because the image files are corrupted? Or what if your dataset is a folder full of scanned PDFs that you have to OCRize, and then run a language detector on the resulting text, because you want only the ones that are in English? Or maybe you have an ^{}, and you want to be able to move to ^{} after ^{} fails while attempting to load!

PyTorch's data processing module expects you to rid your dataset of any unwanted or invalid samples before you feed them into its pipeline, and provides no easy way to define a "fallback policy" in case such samples are encountered during dataset iteration.

Why do I need it?

You might be wondering why this is such a big deal when you could simply ^{} out samples before sending it to your PyTorch dataset or sampler! Well, it turns out that it can be a huge deal in many cases:

  1. When you have a small fraction of undesirable samples in a large dataset, or
  2. When your sample-loading operation is expensive, or
  3. When you want to let downstream consumers know that a sample is undesirable (with nonechucks, transforms are not restricted to modifying samples; they can drop them as well),
  4. When you want your dataset and sampler to be decoupled.

In such cases, it's either simply too expensive to have a separate step to weed out bad samples, or it's just plain impossible because you don't even know what constitutes as "bad", or worse - both!

nonechucks allows you to wrap your existing datasets and samplers with "safe" versions of them, which can fix all these problems for you.

Examples

1. Dealing with bad samples

Let's start with the simplest use case, which involves wrapping an existing ^{} instance with ^{}.

Create a dataset (the usual way)

Using something like torchvision's ImageFolder数据集类,我们可以为一个典型的监督分类任务加载标记图像的整个文件夹。

importtorchvision.datasetsasdatasetsfruits_dataset=datasets.ImageFolder('fruits/')

不带非卡盘

现在,如果您的fruits/apple/143.jpg(已损坏)文件夹中有一个鬼鬼祟祟的fruits/,为了避免整个管道意外失败,您必须使用以下方法:

importrandom# Shuffle datasetindices=list(range(len(fruits_dataset))random.shuffle(indices)batch_size=4foriinrange(0,len(indices),batch_size):try:batch=[fruits_dataset[idx]foridxinindices[i:i+batch_size]]# Do something with itpassexceptIOError:# Skip the entire batchcontinue

您不仅需要将代码放入一个额外的try-except块中,而且还必须使用for循环,这就剥夺了pytorch内置的DataLoader功能,这意味着您不能为数据集使用批处理、洗牌、多处理和自定义采样器等功能。

我不知道你的情况,但如果不能做到这一点,我就无法使用数据处理模块。

带非卡盘

您可以用一行代码将数据集转换为SafeDataset

importnonechucksasncfruits_dataset=nc.SafeDataset(fruits_dataset)

就这样!说真的。

还不止这些。您还可以在上面使用DataLoader

dataloader=nc.SafeDataLoader(fruits_dataset,batch_size=4,shuffle=True)fori_batch,sample_batchedinenumerate(dataloader):# Do something with itpass

在这种情况下,SafeDataset将跳过错误的图像,并使用下一个图像代替它(而不是删除整个批处理)。

2.使用转换作为过滤器!

pytorch中transorms的功能仅限于修改样本。使用nonecucks,您可以简单地从转换的__call__方法返回None(或引发异常),nonecucks将为您从数据集中删除样本,允许您使用转换作为筛选器!

例如,我们假设一个PDFDocumentsDataset,它从文件夹中读取pdf文件,一个PlainTextTransform,它将文件转换为原始文本,一个LanguageFilter,它只保留特定语言的文档。

classLanguageFilter:def__init__(self,language):self.language=languagedef__call__(self,sample):# Do machine learning magicdocument_language=detect_language(sample)ifdocument_language!=self.language:returnNonereturnsampletransforms=transforms.Compose([PlainTextTransform(),LanguageFilter('en')])en_documents=PDFDocumentsDataset(data_dir='pdf_files/',transform=transforms)en_documents=nc.SafeDataset(en_documents)

Installation

To install nonechucks, simply use pip:

^{}

or clone this repo, and build from source with:

^{}.

Contributing

All PRs are welcome.

Licensing

nonechucks is MIT licensed

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
Java程序运行时错误   JavaAndroidStudio:与往常一样,四舍五入到next.5或.0   apache使用Java以表单数据形式上载文件   带矢量的java Freeflight相机如何正确旋转?   java如何以编程方式检索有关当前项目的语言、操作系统、体系结构等信息   java Twitter4J tweet实体?   java PdfBox编码异常   java在拖动未装饰的舞台时,如何强制光标停留在窗口上   JavaSpring注释扫描优化   java无法通过IntelliJ Idea在tomcat上运行服务   java在生命周期中如何拦截请求?   java中的数组返回错误