Python encode-dataframe包_程序模块 - PyPI

将ucsc的编码元数据转换为pandas数据帧

encode-dataframe的Python项目详细描述

我想要一种更好的方法来探索和下载编码中的原始数据项目。

例如，我想获取在未导入的MEL单元（来自MM9程序集）。

一种策略是单独通过每个轨道枢纽（如组蛋白来自licr的mods，http://genome.cit.nih.gov/cgi-bin/hgFileUi?db=mm9&g=wgEncodeLicrHistone），分别过滤数据和下载文件。

另一个策略是直接进入下载页面（http://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC/wgEncodeLicrHistone/）并提取以.bam结尾的文件。

这个小软件包利用了files.txt文件（这里有一个example）它描述了下载页面上的所有元数据。

files.txt文件从程序集中的每个编码磁道集线器下载感兴趣的。然后这些文件被解析并连接成一个 bigpandas.dataframe可用于查找您关心的数据。

安装

pip install encode-dataframe

用法

镜像文件。这可能需要一分钟左右。如果你克隆了Git repo，你已经有mm9文件的副本了。

>>> import encode_dataframe as edf
>>> edf.mirror_metadata_files('mm9')

创建大数据帧：

>>> df = edf.encode_dataframe('mm9')

>>> len(df)
5865

有了数据帧，我们现在可以切片和骰子得到我们关心的数据关于。最后我想对mel细胞进行chromhmm分割，但是我需要先得到数据…

选择单元格类型

>>> interesting = df.cell == 'MEL'

只有BAM文件

>>> interesting &= df.type == 'bam'

只有芯片或数据序列

>>> interesting &= df.dataType.isin(['ChipSeq', 'DnaseSeq'])

只有未经治疗（在本例中，未经诱导）的细胞：

>>> interesting &= df.treatment != 'DMSO_2.0pct'

只有一个复制（有些有2个或3个）

>>> interesting &= df.replicate == '1'

只有那些没有问题的（看起来像老版本在objstatus字段中输入一些文本：

>>> interesting &= df.objStatus.isnull()

我们要和多少人一起工作？

>>> m = df[interesting]
>>> len(m)
60

其中有些是控件（input或igg），有些是重复的（looks 就像h3k4me3芯片seq使用了两个不同的对照；ctcf是由不同的小组）。有多少独特的抗体？

>>> len(m.antibody.unique())
46

下面是我应该下载的文件：

>>> urls = m.url.values

欢迎加入QQ群-->： 979659372

encode-dataframe 0.1

encode-dataframe的Python项目详细描述

安装

用法

推荐PyPI第三方库

partitioner

beautifulsoup4-slurp

collective.civicrm

django-wx

Flask-OpenTracing

rubyenv

cachep

pahelee

sportsreference

SpotiCL

mmailer

feed

pycopy-uctypes

plastid

conditions-p

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

encode-dataframe 0.1

encode-dataframe的Python项目详细描述

安装

用法

推荐PyPI第三方库

partitioner

beautifulsoup4-slurp

collective.civicrm

django-wx

Flask-OpenTracing

rubyenv

cachep

pahelee

sportsreference

SpotiCL

mmailer

feed

pycopy-uctypes

plastid

conditions-p

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签