访问ucsc encode(dna元素百科全书)项目数据的便利包
PyENCODE的Python项目详细描述
这是一个方便的包,用于访问ENCODE (Encyclopedia of DNA Elements) project的原始数据。
原始编码文件在this URL下以相当简单的结构组织。这些文件被分成多个集合(“composites”),每个集合都在其自己的子目录中。每个集合的子目录将所有文件的元数据保存在名为files.txt的文本文件中。例如,genome segmentation数据保存在ROOT_URL/wgEncodeAwgSegmentation下。特别是,对K562单元格使用Combined方法获得的分段保存在名为ROOT_URL/wgEncodeAwgSegmentation/wgEncodeAwgSegmentationCombinedK562.bed.gz的压缩文件BED中。
原则上,下载和读取文件相当简单。此外,这个包还提供了一种更精简的方法,可以列出文件、缓存文件和读取文件元数据。例如,以下代码将上述文件下载到缓存中,然后将其打开并作为间隔树索引:
>> from pyencode import Encode >> e = Encode(cache_dir = 'wgEncode') >> gtree = e.AwgSegmentation.CombinedK562.fetch().read_as_intervaltree()
另一个例子是,如何将AwgSegmentation集合中的所有文件列出并预下载到缓存中:
>> for f in e.AwgSegmentation: >> print("%s-%s" % (f['cell'], f['dataType'])) >> f.fetch()
安装
安装大多数python包的最简单方法是通过easy_install或pip:
$ pip install PyENCODE
用法
包提供的主对象是pyencode.Encode。创建一个实例,指定缓存目录的根目录:
>> from pyencode import Encode >> e = Encode(cache_dir = 'wgEncode')
cache_dir的默认值是~/.pyencode。生成的对象用作字典,其中键是encode:
>> c['AwgSegmentation']
或者,可以使用字段名而不是字典键,即e['AwgSegmentation']与e.AwgSegmentation相同。要遍历所有集合,只需执行以下操作:
>> for c in e: >> print(c.name)
Encode对象的每个元素都是一个EncodeCollection对象,它充当EncodeFile元素的集合:
>> for f in e.AwgSegmentation: >> print(f.name)
类似地,字典样式或字段名访问可用于检索集合中的文件:e.AwgSegmentation['CombinedK562']或e.AwgSegmentation.CombinedK562。
每个EncodeFile都是文件元数据字段的字典:
>> print(e.AwgSegmentation.CombinedK562['cell'])
此外,EncodeFile还提供了一组方便的字段和方法:
- ^{tt22}$ - Download file into cache. Returns the ^{tt17}$ object for convenient chaining of calls. When``force`` is ^{tt24}$, file will not be redownloaded if already in cache.
- ^{tt25}$ - Set of all file attributes that can be accessed via ^{tt26}$.
- ^{tt27}$ - Return the URL of the file online.
- ^{tt28}$ - The URL of the cached copy. It is not guaranteed that the file exists, so it is often more practical to do ^{tt29}$.
- ^{tt30}$ - Return the path of the locally cached copy. It is not guaranteed that the file exists.
- ^{tt31}$ - Open the file in binary mode for reading. If the file is not in cache, it is not downloaded to cache and opened from the web (so, it is often more practical to do ^{tt32}$).
- ^{tt33}$ - Open the file in text mode for reading. If the file is not in cache it is not downloaded to cache and opened from the web. If the file is a .gz file, it is automatically unpacked (i.e. the returned file instance is an opened GzipFile).
- ^{tt34}$ - Read a ^{tt5}$ file into an ^{tt36}$ data structure. Simiarly, if the file is not in cache, it is not automatically downloaded.
请注意,Encode对于执行多线程或多处理是不安全的,除非已经缓存了所有必需的文件。