数据中的相似图像
simages的Python项目详细描述
:monkey:simages:monkey:
在数据集中查找相似的图像。
用于在使用google-images-download刮除图像后从数据集中删除重复图像。
python api返回pairs, duplicates
,其中对是(有序的)最近的对,距离是
相应的嵌入距离。
安装
有关所有详细信息,请参见installation docs。
pip install simages
或从源安装:
git clone https://github.com/justinshenk/simages
cd simages
pip install .
要安装交互界面,install mongodb并使用pip install "simages[all]"
。
演示
- 最小的命令行接口,
simages-show
:
- 与
simages add/find
交互图像删除:
使用量
存在两个接口:
- matplotlib接口,用于打印副本以供目视检查
- MongoDB+Flask接口,允许交互删除[可选]
最小接口
在控制台中,输入带有图像的目录并使用simages-show
:
$ simages-show --data-dir .
usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
[--epochs EPOCHS] [--num-channels NUM_CHANNELS]
[--pairs PAIRS] [--zdim ZDIM] [-s]
-h, --help show this help message and exit
--data-dir DATA_DIR, -d DATA_DIR
Folder containing image data
--show-train, -t Show training of embedding extractor every epoch
--epochs EPOCHS, -e EPOCHS
Number of passes of dataset through model for
training. More is better but takes more time.
--num-channels NUM_CHANNELS, -c NUM_CHANNELS
Number of channels for data (1 for grayscale, 3 for
color)
--pairs PAIRS, -p PAIRS
Number of pairs of images to show
--zdim ZDIM, -z ZDIM Compression bits (bigger generally performs better but
takes more time)
-s, --show Show closest pairs
网络接口[可选]
注意:要安装web接口api,install and run mongodb并使用pip install "simages[all]"
安装可选依赖项。
将您的图片添加到数据库(这将需要一些时间,具体取决于图片的数量)
simages add <images_folder_path>
网页将显示所有相似或重复的图片:
simages find <images_folder_path>
Usage:
simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
simages remove <path> ... [--db=<db_path>]
simages clear [--db=<db_path>]
simages show [--db=<db_path>]
simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
simages -h | --help
Options:
-h, --help Show this screen
--db=<db_path> The location of the database or a MongoDB URI. (default: ./db)
--parallel=<num_processes> The number of parallel processes to run to hash the image
files (default: number of CPUs).
find:
--print Only print duplicate files rather than displaying HTML file
--delete Move all found duplicate pictures to the trash. This option takes priority over --print.
--match-time Adds the extra constraint that duplicate images must have the
same capture times in order to be considered.
--trash=<trash_path> Where files will be put when they are deleted (default: ./Trash)
--epochs=<epochs> Epochs for training [default: 2]
python api
核阵列
fromsimagesimportfind_duplicatesimportnumpyasnparray_data=np.random.random(100,3,48,48)# N x C x H x Wpairs,distances=find_duplicates(array_data)
文件夹
fromsimagesimportfind_duplicatesdata_dir="my_images_folder"pairs,distances=find_duplicates(data_dir)
find_duplicates
的默认选项是:
deffind_duplicates(input:Union[strornp.ndarray],n:int=5,num_epochs:int=2,num_channels:int=3,show:bool=False,show_train:bool=False,**kwargs):"""Find duplicates in dataset. Either `array` or `data_dir` must be specified. Args: input (str or np.ndarray): folder directory or N x C x H x W array n (int): number of closest pairs to identify num_epochs (int): how long to train the autoencoder (more is generally better) show (bool): display the closest pairs show_train (bool): show output every z_dim (int): size of compression (more is generally better, but slower) kwargs (dict): etc, passed to `EmbeddingExtractor` Returns: pairs (np.ndarray): indices for closest pairs of images, n x 2 array distances (np.ndarray): distances of each pair to each other
Embeddings
API
fromsimagesimportEmbeddingsimportnumpyasnpN=1000data=np.random.random((N,28,28))embeddings=Embeddings(data)# Access the arrayarray=embeddings.array# N x z (compression size)# Get 10 closest pairs of imagespairs,distances=embeddings.duplicates(n=5)
In[0]:pairsOut[0]:array([[912,990],[716,790],[907,943],[483,492],[806,883]])In[1]:distancesOut[1]:array([0.00148035,0.00150703,0.00158789,0.00168699,0.00168721])
EmbeddingExtractor
API
fromsimagesimportEmbeddingExtractorimportnumpyasnpN=1000data=np.random.random((N,28,28))extractor=EmbeddingExtractor(data,num_channels=1)# grayscale# Show 10 closest pairs of imagespairs,distances=extractor.show_duplicates(n=10)
类属性和参数:
classEmbeddingExtractor:"""Extract embeddings from data with models and allow visualization. Attributes: trainloader (torch loader) evalloader (torch loader) model (torch.nn.Module) embeddings (np.ndarray) """def__init__(self,input:Union[str,np.ndarray],num_channels=None,num_epochs=2,batch_size=32,show_train=True,show=False,z_dim=8,**kwargs,):"""Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation. Args: input (np.ndarray or str): data num_channels (int): grayscale = 1, color = 3 num_epochs (int): more is better (generally) batch_size (int): number of images per batch show_train (bool): show intermediate training results show (bool): show closest pairs z_dim (int): compression size kwargs (dict) """
指定要用参数n
标识的对数。
工作原理
simages使用带pytorch的卷积自动编码器,并将潜在表示与closely:三角形规则进行比较。
依赖关系
simages取决于 以下软件包:
- closely
- torch
- torchvision
- SCIKIT学习
- matplotlib
可选依赖项,与pip install simages[all]
一起安装,包括:
- pymongodb
- 快速群集
- 烧瓶
- 金贾
- dnspython
- python魔术
- 术语颜色
引用
如果您使用simages,请引用它:
@misc{justin_shenk_2019_3237830,
author = {Justin Shenk},
title = {justinshenk/simages: v19.0.1},
month = jun,
year = 2019,
doi = {10.5281/zenodo.3237830},
url = {https://doi.org/10.5281/zenodo.3237830}
}