如何确定语言？问题的回答

如何确定语言？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<ol> <li><a href="https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.BaseBlob.detect_language" rel="noreferrer">TextBlob</a>。需要NLTK包，使用Google。 <pre><code>from textblob import TextBlob b = TextBlob("bonjour") b.detect_language() </code></pre></li> </ol> <code>pip install textblob</code> <ol start=“2”> <li><a href="https://polyglot.readthedocs.io/en/latest/Installation.html" rel="noreferrer">Polyglot</a>。需要numpy和一些神秘的库，<s>不太可能在Windows下工作。（对于Windows，从<a href="https://www.lfd.uci.edu/~gohlke/pythonlibs/" rel="noreferrer">here</a>获取适当版本的PyICU、Morfessor和PyCLD2，然后只需<code>pip install downloaded_wheel.whl</code>）即可检测混合语言文本。 <pre><code>from polyglot.detect import Detector mixed_text = u""" China (simplified Chinese: 中国; traditional Chinese: 中國), officially the People's Republic of China (PRC), is a sovereign state located in East Asia. """ for language in Detector(mixed_text).languages: print(language) # name: English code: en confidence: 87.0 read bytes: 1154 # name: Chinese code: zh_Hant confidence: 5.0 read bytes: 1755 # name: un code: un confidence: 0.0 read bytes: 0 </code></pre></li> </ol> <code>pip install polyglot</code> 要安装依赖项，请运行： <code>sudo apt-get install python-numpy libicu-dev</code> <ol start=“3”> <li><a href="https://chardet.readthedocs.io/en/latest/usage.html" rel="noreferrer">chardet</a>还有一个特性，即如果在范围（127-255）内有字符字节，则可以检测语言： <pre><code>>>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251')) {'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'} </code></pre></li> </ol> <code>pip install chardet</code> <ol start=“4”> <li><a href="https://pypi.python.org/pypi/langdetect?" rel="noreferrer">langdetect</a>需要大量文本。它在幕后使用了非确定性的方法。这意味着对于同一个文本样本可以得到不同的结果。文档中说，必须使用以下代码来确定： <pre><code>from langdetect import detect, DetectorFactory DetectorFactory.seed = 0 detect('今一はお前さん') </code></pre></li> </ol> <code>pip install langdetect</code> <ol start=“5”> <li><a href="https://bitbucket.org/spirit/guess_language" rel="noreferrer">guess_language</a>可以使用带词典的<a href="https://pythonhosted.org/pyenchant/" rel="noreferrer">this</a>拼写检查器检测非常短的样本。</li> </ol> <code>pip install guess_language-spirit</code> <ol start=“6”> <li><a href="https://github.com/saffsd/langid.py" rel="noreferrer">langid</a>同时提供两个模块 <pre><code>import langid langid.classify("This is a test") # ('en', -54.41310358047485) </code></pre></li> </ol> 以及一个命令行工具： <pre><code> $ langid < README.md </code></pre> <code>pip install langid</code> <ol start=“7”> <li><a href="https://fasttext.cc" rel="noreferrer">FastText</a>是一个文本分类器，可以用来识别176种具有适当<a href="https://fasttext.cc/docs/en/language-identification.html" rel="noreferrer">models for language classification</a>的语言。下载<a href="https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz" rel="noreferrer">this model</a>，然后： <pre><code>import fasttext model = fasttext.load_model('lid.176.ftz') print(model.predict('الشمس تشرق', k=2)) # top 2 matching languages (('__label__ar', '__label__fa'), array([0.98124713, 0.01265871])) </code></pre></li> </ol> <code>pip install fasttext</code>

如何确定语言？

1 个回答

相关Python问题