Python redditnlp包_程序模块 - PyPI

对reddit内容执行自然语言处理的工具。

redditnlp的Python项目详细描述

一个轻量级的python模块，用于执行 reddit上的文字。它允许您分析用户、标题、评论和理解他们的词汇。模块已打包使用自己的反向索引生成器存储词汇和单词频率，这样您就可以生成和操作 tf-idf在不担心实现的情况下对单词进行加权。这是如果您长时间运行脚本并希望以保存中介结果。

许可证

这个程序是免费软件：你可以重新发布和/或修改它根据自由软件基金会，或者许可证的第3版，或者选项）任何更高版本。

这个程序的发布是希望它会有用，但是没有任何保证；甚至没有适销性或适合某一特定目的的适销性。见GNU将军公共许可证了解更多详细信息。

你应该收到一份GNU通用公共许可证的副本用这个程序。如果没有，请参阅http://www.gnu.org/licenses/。

安装

使用PIP或简易安装

您可以使用pip或 easy_install：

pip install redditnlp

错误：所需版本的setuptools不可用

在运行pip install或setup.py脚本时，您可能会得到这样的消息：

The required version of setuptools (>=0.7) is not available, and can't be installed while this script is running. Please install a more recent version first, using 'easy_install -U setuptools'.

这是因为你有一个非常过时的版本 setuptools包。redditnlp包通常引导一个新的安装期间的setuptools版本，但在这种情况下不起作用。您需要使用easy_install -U setuptools更新setuptools（您可能需要对该命令应用sudo。

如果上面的命令不起作用，那么安装setuptools的版本时使用了一个包管理器，如yum， apt或pip。检查包管理器以获取名为 python setuptools或尝试pip install setuptools --upgrade，然后重新运行安装程序。

用法

使用redditnlp模块的更复杂的示例程序可以在 https://github.com/jaijuneja/reddit-nlp/blob/master/example.py。在这里我们概述了一个基本的字计数器应用程序。

该模块由三个类组成：

一个基本的字计数器类WordCounter，它执行标记化和计数输入字符串
reddit字计数器RedditWordCounter，它扩展了 WordCounter类以允许与reddit api进行交互
一个tf-idf语料库生成器，它允许存储大型单词语料库在倒排索引中

这三个类可以实例化如下：

fromredditnlpimportWordCounter,RedditWordCounter,TfidfCorpusword_counter=WordCounter()reddit_counter=RedditWordCounter('your_username')corpus=TfidfCorpus()

为了遵守reddit api规则，我们要求您使用 reddit用户名代替上面的'your_username'。

有关这些类的属性和方法的详细信息您可以运行：

help(WordCounter)help(RedditWordCounter)help(TfidfCorpus)

接下来，我们可以标记来自子reddits选择的1000个注释，提取最常用的单词并将所有数据保存到磁盘：

forsubredditin['funny','aww','pics']:# Tokenize and count words for 1000 commentsword_counts=counter.subreddit_comments(subreddit,limit=1000)# Add the word counts to our corpuscorpus.add_document(word_counts,subreddit)# Save the corpus to a specified path (must be JSON)corpus.save(path='word_counts.json')# Save the top 50 words (by tf-idf score) from each subreddit to a text fileforsubredditincorpus.get_document_list():top_words=corpus.get_top_terms(document,num_terms=50)withopen('top_words.txt','ab')asf:f.write(document+'\n'+'\n'.join(top_words.keys()))

机器学习

redditnlp现在支持scikit learn的一些机器学习能力。多个内置功能允许用户：

将tfidfcompus对象转换为scipy稀疏特征矩阵（使用build_feature_matrix()）
使用tfidfcompus中包含的文档训练分类器（使用train_classifier()）并随后对新文档进行分类（带classify_document()）

下面是一个简单的机器学习应用程序的示例一个subreddit注释数据的语料库，用于训练分类器和确定哪个subreddit用户的评论最接近匹配：

# Load the corpus of subreddit comment data and use it to train a classifiercorpus=TfidfCorpus('path/to/subreddit_corpus.json')corpus.train_classifier(classifier_type='LinearSVC',tfidf=True)# Tokenize all of your commentscounter=RedditWordCounter('your_username')user_comments=counter.user_comments('your_username')# Classify your comments against the documents in the corpusprintcorpus.classify_document(user_comments)

多处理

redditnlp使用PRAW reddit api包装器。它支持多处理，这样您就可以运行不超过reddit的RedditWordCounter的多个实例费率限制。在PRAW documentation中有关于这个的更多信息但为了完整起见，下面提供了一个示例。

首先，必须在本地初始化请求处理服务器机器。这是使用终端/命令行完成的：

praw-multiprocess

接下来，可以实例化多个RedditWordCounter对象并设置参数multiprocess=True，以便传出的api调用处理：

counter = RedditWordCounter('your_username', multiprocess=True)

联系人

如果您有任何问题或遇到错误，请随时请在jai -dot- juneja -at- gmail -dot- com与我联系。

欢迎加入QQ群-->： 979659372

redditnlp 0.1.3

redditnlp的Python项目详细描述

许可证

安装

使用PIP或简易安装

最新开发版本

错误：所需版本的setuptools不可用

用法

机器学习

多处理

联系人

推荐PyPI第三方库

ArgumentParserClass

typed-csv

dist1-probabilit

hashable_lru_cache

flake8usefstring

flasksockets

lycan

gn-arcrest

jmcursed

liquibase

qcheck

test-distributions-class

classifier-yan-zababurina2

cxyx_monitor

alerce

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

redditnlp 0.1.3

redditnlp的Python项目详细描述

许可证

安装

使用PIP或简易安装

最新开发版本

错误：所需版本的setuptools不可用

用法

机器学习

多处理

联系人

推荐PyPI第三方库

ArgumentParserClass

typed-csv

dist1-probabilit

hashable_lru_cache

flake8usefstring

flasksockets

lycan

gn-arcrest

jmcursed

liquibase

qcheck

test-distributions-class

classifier-yan-zababurina2

cxyx_monitor

alerce

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签