一个多功能的工具,用于以.cool格式对hi-c数据执行堆积分析。
coolpupp的Python项目详细描述
冷却时间py
cool文件pile-ups和python。
简介
以.cool格式(https://github.com/mirnylab/cooler)对hi-c数据执行堆积分析的通用工具。谁不喜欢酷的木偶?
<酷>是一个现代的、灵活的(最好的,我认为)格式来存储HI-C数据。 它使用hdf5来存储稀疏的hi-c数据表示,这使得在处理高分辨率数据集时内存需求较低。另一种存储hi-c数据的流行格式,.hic,可以使用hic2cool
(https://github.com/4dn-dcic/hic2cool)转换为.cool文件。有关详细信息,请参见:
Abdennur,N.和Mirny,L.(2019年)。冷却器:可扩展存储hi-c数据和其他基因组标记阵列。BioXiV,557660。doi:10.1101/557660
什么是连环相撞?
这就是连环相撞的原理,用来检查某些区域是否倾向于相互作用:
这里没有显示的是对预期值的标准化。这可以通过两种方式实现:要么使用具有不同距离(输出cooltools compute-expected
)的预期交互值的提供的文件,要么直接从hi-c数据中通过在随机移动的控制区域上划分堆来实现。如果不使用预期的规范化方法(仅设置--nshifts 0
),则这与apa方法基本相同(rao等人,2014),后者可用于平均强相互作用区域,例如带注释的循环。对于较弱的相互作用体,接触概率随距离的衰减将隐藏任何可以观察到的焦点富集。
coolpup.py
特别适合于分析大量潜在的相互作用,因为它一个接一个地将整个染色体加载到内存中(或并行加速),以快速提取小的子矩阵。必须将所有内容读入内存会使少量循环的速度相对较慢,但在达到大量交互之前,性能不会降低。
入门
安装
除了cooltools
之外的所有需求都可以从pypi或conda获得。对于cooltools
,请执行
pip install https://github.com/mirnylab/cooltools/archive/master.zip
对于coolpuppy(和其他依赖项),只需执行以下操作:
pip install coolpuppy
或
pip install https://github.com/Phlya/coolpuppy/archive/master.zip
从github获取最新版本。这将使coolpup.py
在终端中可调用,并且在python中可作为coolpuppy
导入。
用法
帮助消息将帮助您开始使用该工具。它是一个命令,有很多选择,可以做很多事情!
Usage: coolpup.py [-h] [--pad PAD] [--minshift MINSHIFT] [--maxshift MAXSHIFT]
[--nshifts NSHIFTS] [--expected EXPECTED]
[--mindist MINDIST] [--maxdist MAXDIST] [--minsize MINSIZE]
[--maxsize MAXSIZE] [--excl_chrs EXCL_CHRS]
[--incl_chrs INCL_CHRS] [--subset SUBSET] [--anchor ANCHOR]
[--by_window] [--save_all] [--local] [--unbalanced]
[--coverage_norm] [--rescale] [--rescale_pad RESCALE_PAD]
[--rescale_size RESCALE_SIZE] [--weight_name WEIGHT_NAME]
[--n_proc N_PROC] [--outdir OUTDIR] [--outname OUTNAME]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
coolfile baselist
positional arguments:
coolfile Cooler file with your Hi-C data
baselist A 3-column bed file or a 6-column double-bed file
(i.e. chr1,start1,end1,chr2,start2,end2). Should be
tab-delimited. With a bed file, will consider all cis
combinations of intervals. To pileup features along
the diagonal instead, use the --local argument. Can be
piped in via stdin, then use "-".
optional arguments:
-h, --help show this help message and exit
--pad PAD Padding of the windows around the centres of specified
features (i.e. final size of the matrix is 2×pad+res),
in kb. Ignored with --rescale, use --rescale_pad
instead. (default: 100)
--minshift MINSHIFT Shortest distance for randomly shifting coordinates
when creating controls (default: 100000)
--maxshift MAXSHIFT Longest distance for randomly shifting coordinates
when creating controls (default: 1000000)
--nshifts NSHIFTS Number of control regions per averaged window
(default: 10)
--expected EXPECTED File with expected (output of cooltools compute-
expected). If None, don't use expected and use
randomly shifted controls (default: None)
--mindist MINDIST Minimal distance of intersections to use. If not
specified, uses 2*pad+2 (in bins) as mindist (default:
None)
--maxdist MAXDIST Maximal distance of intersections to use (default:
None)
--minsize MINSIZE Minimal length of features to use for local analysis
(default: None)
--maxsize MAXSIZE Maximal length of features to use for local analysis
(default: None)
--excl_chrs EXCL_CHRS
Exclude these chromosomes from analysis (default:
chrY,chrM)
--incl_chrs INCL_CHRS
Include these chromosomes; default is all. excl_chrs
overrides this. (default: all)
--subset SUBSET Take a random sample of the bed file - useful for
files with too many featuers to run as is, i.e. some
repetitive elements. Set to 0 or lower to keep all
data. (default: 0)
--anchor ANCHOR A UCSC-style coordinate to use as an anchor to create
intersections with coordinates in the baselist
(default: None)
--by_window Create a pile-up for each coordinate in the baselist.
Will save a master-table with coordinates, their
enrichments and cornerCV, which is reflective of
noisiness (default: False)
--save_all If by-window, save all individual pile-ups in a
separate json file (default: False)
--local Create local pileups, i.e. along the diagonal
(default: False)
--unbalanced Do not use balanced data. Useful for single-cell Hi-C
data together with --coverage_norm, not recommended
otherwise. (default: False)
--coverage_norm If --unbalanced, also add coverage normalization based
on chromosome marginals (default: False)
--rescale Do not use centres of features and pad, and rather use
the actual feature sizes and rescale pileups to the
same shape and size (default: False)
--rescale_pad RESCALE_PAD
If --rescale, padding in fraction of feature length
(default: 1.0)
--rescale_size RESCALE_SIZE
If --rescale, this is used to determine the final size
of the pileup, i.e. it will be size×size. Due to
technical limitation in the current implementation,
has to be an odd number (default: 99)
--weight_name WEIGHT_NAME
Name of the norm to use for getting balanced data
(default: weight)
--n_proc N_PROC Number of processes to use. Each process works on a
separate chromosome, so might require quite a bit more
memory, although the data are always stored as sparse
matrices (default: 1)
--outdir OUTDIR Directory to save the data in (default: .)
--outname OUTNAME Name of the output file. If not set, is generated
automatically to include important information.
(default: auto)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set the logging level. (default: INFO)
目前,coolpup.py
不支持染色体间的堆积,但这是一个计划在未来添加。
绘图结果
灵活绘制,建议使用{{CD13}}。但是这个包中包含了简单的绘图功能。只需使用所需选项运行plotpup.py
,并列出要绘制的coolpup.py
的所有输出文件。
Usage: plotpup.py [-h] [--cmap CMAP] [--symmetric SYMMETRIC] [--vmin VMIN]
[--vmax VMAX] [--scale {linear,log}]
[--cbar_mode {edge,each,single}] [--n_cols N_COLS]
[--col_names COL_NAMES] [--row_names ROW_NAMES]
[--norm_corners NORM_CORNERS] [--enrichment ENRICHMENT]
[--output OUTPUT]
[pileup_files [pileup_files ...]]
positional arguments:
pileup_files All files to plot (default: None)
optional arguments:
-h, --help show this help message and exit
--cmap CMAP Colourmap to use (see
https://matplotlib.org/users/colormaps.html) (default:
coolwarm)
--symmetric SYMMETRIC
Whether to make colormap symmetric around 1, if log
scale (default: True)
--vmin VMIN Value for the lowest colour (default: None)
--vmax VMAX Value for the highest colour (default: None)
--scale {linear,log} Whether to use linear or log scaling for mapping
colours (default: log)
--cbar_mode {edge,each,single}
Whether to show a single colorbar, one per row or one
for each subplot (default: single)
--n_cols N_COLS How many columns to use for plotting the data. If 0,
automatically make the figure as square as possible
(default: 0)
--col_names COL_NAMES
A comma separated list of column names (default: None)
--row_names ROW_NAMES
A comma separated list of row names (default: None)
--norm_corners NORM_CORNERS
Whether to normalize pileups by their top left and
bottom right corners. 0 for no normalization, positive
number to define the size of the corner squares whose
values are averaged (default: 0)
--enrichment ENRICHMENT
Whether to show the level of enrichment in the central
pixels. 0 to not show, odd positive number to define
the size of the central square whose values are
averaged (default: 1)
--output OUTPUT Where to save the plot (default: pup.pdf)
引用coolpup.py
在发表在同行评议的期刊上之前,请引用我们的预印本
coolpup.py-一个多功能的工具,用于执行hi-c数据的堆积分析
Ilya M.Flyamer、Robert S.Illingworth、Wendy A.Bickmore
https://www.biorxiv.org/content/10.1101/586537v1
此工具已在下列出版物中使用过
dna甲基化指导多omb依赖的3d基因组在原始多能性中的重组
凯蒂A麦克劳林,伊利亚M弗利默,约翰P汤姆森,海蒂K乔森,鲁奇舒克拉,伊恩威廉森,格雷姆R格里姆斯,罗伯特S伊林沃斯,伊恩R亚当斯,萨里彭宁,理查德R米汉,温迪A比克莫尔
https://www.biorxiv.org/content/10.1101/527309v1