为数据科学家提供方便的访问准备好的数据的工具。
odus的Python项目详细描述
# %load_ext autoreload# %autoreload 2
简介
ODU(老年吸毒者研究)包含研究老年吸毒者吸毒情况的数据和工具。在
从本质上讲,这些是工具:
- 在
获取119个不同受访者的119个“轨迹”数据,描述了119个不同受访者的31个变量(吸毒、社交等)。在
在 - 在
以各种方式将这些轨迹形象化
在 - 在
创建这些轨迹和变量的任意选择的PDF
在 - 在
为变量的任何组合制作计数表:任何马尔可夫或贝叶斯分析的基本步骤。在
在 - 在
根据变量的任何组合制作概率表(联合表或条件表)
在 - 在
对这些计数表和概率表进行运算,从而使推理运算成为可能
在
安装
你需要有python3.7+才能运行这个笔记本。在
你需要有odus
,这是你通过做得到的
(如果你没有皮普,那么。。。怎么说。。。哈哈哈!)在
但是如果您是类型,您也可以从https://github.com/thorwhalen/odus
获取源代码。在
哦,还有拉请求等等,都欢迎!在
明星,喜欢,推荐,咖啡也很受欢迎。在
如果你想捐款:捐给一个慈善机构,帮助人们了解和制定有关物质使用的政策。在
关于架构的简单流程图:
得到一些资源
frommatplotlib.pylabimport*fromnumpyimport*importseabornassnsimportosfrompy2store.stores.local_storeimportRelativePathFormatStorefrompy2store.mixinsimportReadOnlyMixinfrompy2store.baseimportStorefromioimportBytesIOfromspyn.ppi.potimportPot,ProbPotfromcollectionsimportUserDict,Counterimportnumpyasnpimportpandasaspdfromut.ml.feature_extraction.sequential_var_setsimportPVar,VarSet,DfData,VarSetFactoryfromIPython.displayimportImagefromodus.analysis_utilsimport*fromodus.daccimportDfStore,counts_of_kps,Dacc,VarSetCountsStore, \ mk_pvar_struct,PotStore,_commun_columns_of_dfs,Struct,mk_pvar_str_struct,VarStrfromodus.plot_utilsimportplot_life_course
fromodusimportdata_dir,data_path_ofsurvey_dir=data_dirdata_dir
'/D/Dropbox/dev/p3/proj/odus/odus/data'
df_store=DfStore(data_dir+'/{}.xlsx')len(df_store)cstore=VarSetCountsStore(df_store)v=mk_pvar_struct(df_store,only_for_cols_in_all_dfs=True)s=mk_pvar_str_struct(v)f,df=cstore.df_store.head()pstore=PotStore(df_store)
闲逛
df\ U商店
df_store是一个键值存储,其中key是xls文件,value是准备好的数据帧
len(df_store)
119
it=iter(df_store.values())foriinrange(5):# skip five first_=next(it)df=next(it)# get the one I wantdf.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
print(df.columns.values)
['RURAL' 'SUBURBAN' 'URBAN/CITY' 'HOMELESS' 'INCARCERATION' 'WORK'
'SON/DAUGHTER' 'SIBLING' 'FATHER/MOTHER' 'SPOUSE'
'OTHER (WHO?, FILL IN BRACKETS HERE)' 'FRIEND USER' 'FRIEND NON USER'
'MENTAL ILLNESS' 'PHYSICAL ILLNESS' 'LOSS OF LOVED ONE' 'TOBACCO'
'MARIJUANA' 'ALCOHOL' 'HAL/LSD/XTC/CLUBDRUG' 'COCAINE/CRACK'
'METHAMPHETAMINE' 'AS PRESCRIBED OPIOID' 'NOT AS PRESCRIBED OPIOID'
'HEROIN' 'OTHER OPIOID' 'INJECTED' 'IN TREATMENT' 'Selects States below'
'Georgia' 'Pennsylvania']
t=df[['ALCOHOL','TOBACCO']]t.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
c=Counter()fori,rint.iterrows():c.update([tuple(r.to_list())])c
Counter({(0, 0): 6, (1, 0): 4, (1, 1): 9, (0, 1): 2})
defcount_tuples(dataframe):c=Counter()fori,rindataframe.iterrows():c.update([tuple(r.to_list())])returnc
fields=['ALCOHOL','TOBACCO']# do it for every onec=Counter()fordfindf_store.values():c.update(count_tuples(df[fields]))c
Counter({(0, 1): 903, (1, 1): 1343, (0, 0): 240, (1, 0): 179})
pd.Series(c)^{pr21}$
# Powerful! You can use that with several pairs and get some nice probabilities. Look up Naive Bayes.
观察轨迹
importitertoolsfromfunctoolsimportpartialfromodus.utilimportwrite_imagesfromodus.plot_utilsimportplot_life,life_plots,write_trajectories_to_fileihead=lambdait:itertools.islice(it,0,5)
查看单个轨迹
k=next(iter(df_store))# get the first keyprint(f"k: {k}")# print itplot_life(df_store[k])# plot the trajectory
k: surveys/B24.xlsx
plot_life(df_store[k],fields=[s.in_treatment,s.injected])# only want two fields
翻转所有(或部分)轨迹
gen=life_plots(df_store)
next(gen)# launch to get the next trajectory
<matplotlib.axes._subplots.AxesSubplot at 0x12b21f070>
得到三个轨迹,但只能超过两个区域。在
# fields = [s.in_treatment, s.injected]fields=[s.physical_illness,s.as_prescribed_opioid,s.heroin,s.other_opioid]keys=list(df_store)[:10]# print(f"keys={keys}")axs=[xforxinlife_plots(df_store,fields,keys=keys)];
制作轨迹的pdf
^{pr31}$write_trajectories_to_file(df_store,fp='all_respondents_all_fields.pdf');
Demo s and v
print(list(filter(lambdax:notx.startswith('__'),dir(s))))
['alcohol', 'as_prescribed_opioid', 'cocaine_crack', 'father_mother', 'hal_lsd_xtc_clubdrug', 'heroin', 'homeless', 'in_treatment', 'incarceration', 'injected', 'loss_of_loved_one', 'marijuana', 'mental_illness', 'methamphetamine', 'not_as_prescribed_opioid', 'other_opioid', 'physical_illness', 'rural', 'sibling', 'son_daughter', 'suburban', 'tobacco', 'urban_city', 'work']
^{pr35}$
'HEROIN'
v.heroin
PVar('HEROIN', 0)
v.heroin-1
PVar('HEROIN', -1)
cstore公司
# cstore[v.alcohol, v.tobacco]cstore[v.as_prescribed_opioid-1,v.heroin]
Counter({(0, 0): 1026, (1, 0): 264, (0, 1): 1108, (1, 1): 148})
pd.Series(cstore[v.as_prescribed_opioid-1,v.heroin])
0 0 1026
1 0 264
0 1 1108
1 1 148
dtype: int64
cstore[v.alcohol,v.tobacco,v.heroin]
Counter({(0, 0, 1): 427,
(1, 0, 1): 656,
(1, 1, 1): 687,
(0, 0, 0): 189,
(0, 1, 1): 476,
(0, 1, 0): 51,
(1, 0, 0): 133,
(1, 1, 0): 46})
cstore[v.alcohol-1,v.alcohol]
Counter({(0, 0): 994, (1, 1): 1375, (1, 0): 90, (0, 1): 87})
cstore[v.alcohol-1,v.alcohol,v.tobacco]
Counter({(0, 0, 1): 807,
(1, 1, 1): 1220,
(1, 0, 0): 26,
(0, 1, 1): 76,
(0, 0, 0): 187,
(1, 1, 0): 155,
(0, 1, 0): 11,
(1, 0, 1): 64})
^{pr51}$
<pandas.core.indexing._LocIndexer at 0x130955db0>
pstore公司
^{pr53}$ ^{pr54}$ ^{pr55}$.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
t/[]
pval
ALCOHOL-1 ALCOHOL
0 0 0.390416
1 0.034171
1 0 0.035350
1 0.540063
t[s.alcohol-1]
pval
ALCOHOL-1
0 1081
1 1465
^{pr61}$
pval
ALCOHOL-1 ALCOHOL
0 0 0.919519
1 0.080481
1 0 0.061433
1 0.938567
tt=pstore[s.alcohol,s.tobacco]tt
pval
ALCOHOL TOBACCO
0 0 240
1 903
1 0 179
1 1343
tt/tt[s.alcohol]
pval
ALCOHOL TOBACCO
0 0 0.209974
1 0.790026
1 0 0.117608
1 0.882392
tt/tt[s.tobacco]
pval
ALCOHOL TOBACCO
0 0 0.572792
1 0 0.427208
0 1 0.402048
1 1 0.597952
Scrap place
t=pstore[s.as_prescribed_opioid-1,s.heroin-1,s.heroin]t
pval
AS PRESCRIBED OPIOID-1 HEROIN-1 HEROIN
0 0 0 927
1 172
1 0 99
1 936
1 0 0 249
1 33
1 0 15
1 115
^{pr71}$
pval
AS PRESCRIBED OPIOID-1 HEROIN-1 HEROIN
0 0 0 0.843494
1 0.156506
1 0 0.095652
1 0.904348
1 0 0 0.882979
1 0.117021
1 0 0.115385
1 0.884615
tt.tb
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
AS PRESCRIBED OPIOID-1 HEROIN-1 HEROIN
0 0 0 0.843494
0 0 1 0.156506
1 0 0 0.882979
1 0 1 0.117021
0.117021/0.156506^{pr77}$ ^{pr78}$
0.6918605658949217
prob_of_heroin_given_not_presc_op/prob_of_heroin_given_presc_op
1.4453779407220584
微积分实验
# survey_dir = '/D/Dropbox/others/Miriam/python/ProcessedSurveys'df_store=DfStore(survey_dir+'/{}.xlsx')len(df_store)
119
^{pr84}$
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
^{pr86}$
^{pr87}$
^{pr88}$
^{pr89}$
pval
HOMELESS-1 INCARCERATION
0 0 0.663786
1 0.226630
1 0 0.075412
1 0.034171
pstore[v.incarceration]^{pr92}$ ^{pr93}$
pval
ALCOHOL-1 LOSS OF LOVED ONE
0 0 990
1 91
1 0 1321
1 144
^{pr95}$
^{pr96}$
^{pr97}$
^{pr98}$
^{pr99}$
w/[]^{pr101}$
(evid_m*mw)/[]
pval
MARIJUANA WORK
1 0 0.350603
1 0.649397
(evid_t*tw)/[]
pval
TOBACCO WORK
1 0 0.313001
1 0.686999
(evid_a*aw)/[]
pval
ALCOHOL WORK
1 0 0.29435
1 0.70565
额外废料
# from graphviz import Digraph# Digraph(body="""# raw -> data -> count -> prob# raw [label="excel files (one per respondent)" shape=folder]# data [label="dataframes" shape=folder]# count [label="counts for any combinations of the variables in the data" shape=box3d]# prob [label="probabilities for any combinations of the variables in the data" shape=box3d]# """.split('\n'))
- 项目
标签: