Python junkdetect包_程序模块 - PyPI

垃圾探测器

junkdetect的Python项目详细描述

垃圾，不是垃圾探测器

这个工具只做一个简单的任务：检测各种语言的垃圾文本，而不是垃圾文本。就像那个著名的hotdog not-hotdog，但应用于自然语言文本。测试提取、解压缩和/或解密自然语言文本的工具非常有用。在

设置

# Optionally create a brand new conda environment for this#conda create -n junkdetect python=3.7#conda activate junkdetect# Install: use only one of these methods# 1. from pypi; recommended
pip install junkdetect

# 2. latest master branch
pip install git+https://github.com/thammegowda/junkdetect

# 3. for development
git clone https://github.com/thammegowda/junkdetect \&&cd junkdetect \&& pip install --editable .

如何使用

一旦您通过pip安装它，就可以使用junkdetect或{}从命令行调用

^{pr2}$

输出是每个输入一行，两列用\t分隔。第一列有perplexity：较低的值（即接近0.0）表示垃圾，较高的值（接近1.0）表示不垃圾。如果您不想在输出中返回输入语句，请将它们删掉——只需使用junkdetect | cut -f1 > scores.txt

这是怎么回事

junkdetect看起来只不过是几行python代码，但在幕后隐藏了大量的复杂性。
它使用了来自神经（掩蔽/自回归）语言模型的困惑，这些模型是根据100多种语言的万亿字节网络文本训练而来的。
具体地说，它使用facebooksresearch从torch.hub检索的XML-R。引用XML-R和their paper, (see Table 6)的原始开发人员

XLM-R handles the following 100 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

背景和致谢：

这个想法来自于与Tim Allison的讨论。他说，很难判断文本是否正确地从使用apachetika的pdf文件中提取出来。感谢他让我想到这样的事情。在
我读过Facebook的非常好的XML-R paper of Conneau et al，这是我脑海中的一个念头。虽然XLM的人didnt help me get perplexity, and I had to dug it out of their code by myself，我仍然要感谢他们通过torch.hub使这些有用的预训练模型易于使用。在

开发商：

Thamme Gowda（编写版本0.1）

欢迎加入QQ群-->： 979659372

junkdetect 0.1.2

junkdetect的Python项目详细描述

垃圾，不是垃圾探测器

设置

如何使用

这是怎么回事

背景和致谢：

开发商：

推荐PyPI第三方库

regenwolken

Flask-Zen

tranchitella.recipe.fs

gitplots

piqueserver

turuumymathpython

dogslow

irisclient

orbis-plugin-aggregation-babelfl

pytimeparse

El-Cuestionario

thumbor-memcached

es-search-exporter

twitchio

mediagoblin-indexedsearch

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

junkdetect 0.1.2

junkdetect的Python项目详细描述

垃圾，不是垃圾探测器

设置

如何使用

这是怎么回事

背景和致谢：

开发商：

推荐PyPI第三方库

regenwolken

Flask-Zen

tranchitella.recipe.fs

gitplots

piqueserver

turuumymathpython

dogslow

irisclient

orbis-plugin-aggregation-babelfl

pytimeparse

El-Cuestionario

thumbor-memcached

es-search-exporter

twitchio

mediagoblin-indexedsearch

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签