数据中的相似图像

simages的Python项目详细描述


:monkey:simages:monkey:

PyPI versionBuild StatusDocumentation StatusDOIBinder

在数据集中查找相似的图像。

用于在使用google-images-download刮除图像后从数据集中删除重复图像。

python api返回pairs, duplicates,其中对是(有序的)最近的对,距离是 相应的嵌入距离。

安装

有关所有详细信息,请参见installation docs

pip install simages

或从源安装:

git clone https://github.com/justinshenk/simages
cd simages
pip install .

要安装交互界面,install mongodb并使用pip install "simages[all]"

演示

  1. 最小的命令行接口,simages-show

simages_demo

  1. simages add/find交互图像删除: simages_web_demo

使用量

存在两个接口:

  1. matplotlib接口,用于打印副本以供目视检查
  2. MongoDB+Flask接口,允许交互删除[可选]

最小接口

在控制台中,输入带有图像的目录并使用simages-show

$ simages-show --data-dir .
usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
                    [--epochs EPOCHS] [--num-channels NUM_CHANNELS]
                    [--pairs PAIRS] [--zdim ZDIM] [-s]

  -h, --help            show this help message and exit
  --data-dir DATA_DIR, -d DATA_DIR
                        Folder containing image data
  --show-train, -t      Show training of embedding extractor every epoch
  --epochs EPOCHS, -e EPOCHS
                        Number of passes of dataset through model for
                        training. More is better but takes more time.
  --num-channels NUM_CHANNELS, -c NUM_CHANNELS
                        Number of channels for data (1 for grayscale, 3 for
                        color)
  --pairs PAIRS, -p PAIRS
                        Number of pairs of images to show
  --zdim ZDIM, -z ZDIM  Compression bits (bigger generally performs better but
                        takes more time)
  -s, --show            Show closest pairs

网络接口[可选]

注意:要安装web接口api,install and run mongodb并使用pip install "simages[all]"安装可选依赖项。

将您的图片添加到数据库(这将需要一些时间,具体取决于图片的数量)

simages add <images_folder_path>

网页将显示所有相似或重复的图片:

simages find <images_folder_path>
Usage:
    simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
    simages remove <path> ... [--db=<db_path>]
    simages clear [--db=<db_path>]
    simages show [--db=<db_path>]
    simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
    simages -h | --help
Options:
    -h, --help                Show this screen
    --db=<db_path>            The location of the database or a MongoDB URI. (default: ./db)
    --parallel=<num_processes> The number of parallel processes to run to hash the image
                               files (default: number of CPUs).
    find:
        --print               Only print duplicate files rather than displaying HTML file
        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.
        --match-time          Adds the extra constraint that duplicate images must have the
                              same capture times in order to be considered.
        --trash=<trash_path>  Where files will be put when they are deleted (default: ./Trash)
        --epochs=<epochs>     Epochs for training [default: 2]

python api

核阵列
fromsimagesimportfind_duplicatesimportnumpyasnparray_data=np.random.random(100,3,48,48)# N x C x H x Wpairs,distances=find_duplicates(array_data)

文件夹

fromsimagesimportfind_duplicatesdata_dir="my_images_folder"pairs,distances=find_duplicates(data_dir)

find_duplicates的默认选项是:

deffind_duplicates(input:Union[strornp.ndarray],n:int=5,num_epochs:int=2,num_channels:int=3,show:bool=False,show_train:bool=False,**kwargs):"""Find duplicates in dataset. Either `array` or `data_dir` must be specified.    Args:        input (str or np.ndarray): folder directory or N x C x H x W array        n (int): number of closest pairs to identify        num_epochs (int): how long to train the autoencoder (more is generally better)        show (bool): display the closest pairs        show_train (bool): show output every        z_dim (int): size of compression (more is generally better, but slower)        kwargs (dict): etc, passed to `EmbeddingExtractor`    Returns:        pairs (np.ndarray): indices for closest pairs of images, n x 2 array        distances (np.ndarray): distances of each pair to each other

EmbeddingsAPI

fromsimagesimportEmbeddingsimportnumpyasnpN=1000data=np.random.random((N,28,28))embeddings=Embeddings(data)# Access the arrayarray=embeddings.array# N x z (compression size)# Get 10 closest pairs of imagespairs,distances=embeddings.duplicates(n=5)
In[0]:pairsOut[0]:array([[912,990],[716,790],[907,943],[483,492],[806,883]])In[1]:distancesOut[1]:array([0.00148035,0.00150703,0.00158789,0.00168699,0.00168721])

EmbeddingExtractorAPI

fromsimagesimportEmbeddingExtractorimportnumpyasnpN=1000data=np.random.random((N,28,28))extractor=EmbeddingExtractor(data,num_channels=1)# grayscale# Show 10 closest pairs of imagespairs,distances=extractor.show_duplicates(n=10)

类属性和参数:

classEmbeddingExtractor:"""Extract embeddings from data with models and allow visualization.    Attributes:        trainloader (torch loader)        evalloader (torch loader)        model (torch.nn.Module)        embeddings (np.ndarray)    """def__init__(self,input:Union[str,np.ndarray],num_channels=None,num_epochs=2,batch_size=32,show_train=True,show=False,z_dim=8,**kwargs,):"""Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation.    Args:    input (np.ndarray or str): data    num_channels (int): grayscale = 1, color = 3    num_epochs (int): more is better (generally)    batch_size (int): number of images per batch    show_train (bool): show intermediate training results    show (bool): show closest pairs    z_dim (int): compression size    kwargs (dict)    """

指定要用参数n标识的对数。

工作原理

simages使用带pytorch的卷积自动编码器,并将潜在表示与closely:三角形规则进行比较。

依赖关系

simages取决于 以下软件包:

可选依赖项,与pip install simages[all]一起安装,包括:

  • pymongodb
  • 快速群集
  • 烧瓶
  • 金贾
  • dnspython
  • python魔术
  • 术语颜色

引用

如果您使用simages,请引用它:

    @misc{justin_shenk_2019_3237830,
      author       = {Justin Shenk},
      title        = {justinshenk/simages: v19.0.1},
      month        = jun,
      year         = 2019,
      doi          = {10.5281/zenodo.3237830},
      url          = {https://doi.org/10.5281/zenodo.3237830}
    }

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
解释java选择方法   连接到127.0.0.1的java间歇性故障,连接到IP(eth0)时没有故障   java如何优雅地杀死hadoop作业/intercept`hadoop作业杀死`   java如何通过引导类加载器以编程方式加载另一个类?   url Java:在查询参数之前使用片段构建URI   在BroadLeaf表blc_order_属性中保存OrderAttributes值时发生java错误   安卓将功能从xml转换为java   java如何将数据写入文件?   java JPA SQL结果映射   Java中整数对象比较运算符的引用安全性   Spring测试失败:java。lang.NoClassDefFoundError:org/springframework/cglib/transform/impl/memorysafuendecaredthrowableStrategy   rich:extendedDataTable中的java行选择和数据处理   java为什么我需要在volatile上对多个线程使用synchronized?   java尽管构建成功,但为什么会出现此错误?   数组$ArrayList不能转换为java。util。java中的ArrayList   java如何根据泛型类型调用方法?   java将JLabel添加到JPanel,将JPanel添加到JFrame   如果MapStruct中的源为null,则java将父目标设置为null   JavaJBossDrools从DRL插入事实   java不同的JRE安装(windows)