<ol>
<li><p><a href="https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.BaseBlob.detect_language" rel="noreferrer">TextBlob</a>。需要NLTK包,使用Google。</p>
<pre><code>from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()
</code></pre></li>
</ol>
<p><code>pip install textblob</code></p>
<ol start=“2”>
<li><p><a href="https://polyglot.readthedocs.io/en/latest/Installation.html" rel="noreferrer">Polyglot</a>。需要numpy和一些神秘的库,<s>不太可能在Windows下工作。(对于Windows,从<a href="https://www.lfd.uci.edu/~gohlke/pythonlibs/" rel="noreferrer">here</a>获取适当版本的<strong>PyICU</strong>、<strong>Morfessor</strong>和<strong>PyCLD2</strong>,然后只需<code>pip install downloaded_wheel.whl</code>)即可检测混合语言文本。</p>
<pre><code>from polyglot.detect import Detector
mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
for language in Detector(mixed_text).languages:
print(language)
# name: English code: en confidence: 87.0 read bytes: 1154
# name: Chinese code: zh_Hant confidence: 5.0 read bytes: 1755
# name: un code: un confidence: 0.0 read bytes: 0
</code></pre></li>
</ol>
<p><code>pip install polyglot</code></p>
<p>要安装依赖项,请运行:
<code>sudo apt-get install python-numpy libicu-dev</code></p>
<ol start=“3”>
<li><p><a href="https://chardet.readthedocs.io/en/latest/usage.html" rel="noreferrer">chardet</a>还有一个特性,即如果在范围(127-255)内有字符字节,则可以检测语言:</p>
<pre><code>>>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))
{'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'}
</code></pre></li>
</ol>
<p><code>pip install chardet</code></p>
<ol start=“4”>
<li><p><a href="https://pypi.python.org/pypi/langdetect?" rel="noreferrer">langdetect</a>需要大量文本。它在幕后使用了非确定性的方法。这意味着对于同一个文本样本可以得到不同的结果。文档中说,必须使用以下代码来确定:</p>
<pre><code>from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
detect('今一はお前さん')
</code></pre></li>
</ol>
<p><code>pip install langdetect</code></p>
<ol start=“5”>
<li><a href="https://bitbucket.org/spirit/guess_language" rel="noreferrer">guess_language</a>可以使用带词典的<a href="https://pythonhosted.org/pyenchant/" rel="noreferrer">this</a>拼写检查器检测非常短的样本。</li>
</ol>
<p><code>pip install guess_language-spirit</code></p>
<ol start=“6”>
<li><p><a href="https://github.com/saffsd/langid.py" rel="noreferrer">langid</a>同时提供两个模块</p>
<pre><code>import langid
langid.classify("This is a test")
# ('en', -54.41310358047485)
</code></pre></li>
</ol>
<p>以及一个命令行工具:</p>
<pre><code> $ langid < README.md
</code></pre>
<p><code>pip install langid</code></p>
<ol start=“7”>
<li><p><a href="https://fasttext.cc" rel="noreferrer">FastText</a>是一个文本分类器,可以用来识别176种具有适当<a href="https://fasttext.cc/docs/en/language-identification.html" rel="noreferrer">models for language classification</a>的语言。下载<a href="https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz" rel="noreferrer">this model</a>,然后:</p>
<pre><code>import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('الشمس تشرق', k=2)) # top 2 matching languages
(('__label__ar', '__label__fa'), array([0.98124713, 0.01265871]))
</code></pre></li>
</ol>
<p><code>pip install fasttext</code></p>