HIFI-SE

HIFI-SE的Python项目详细描述


Hifi-Barcode-SE400

bgiseq-500平台已经推出了一种新的测试测序试剂盒,能够进行单端400 bp测序(se400),这为高效实现dna条形码提供了一种简单可靠的方法。本研究探讨bgiseq-500se400测序在dna条码参考构建中的应用潜力,同时提供一个更新的hifi条码软件包,可以利用长度为400bp的hts读取产生coi条码组件。

手动

manual book

版本

1.0.5版python
  • v1.0.5 2019-0409添加对压缩fastq的支持,修复分类错误
  • v1.0.4 2019-04-02修复“polish”错误,并更新bold U identification模块
  • v1.0.3 2018-12-14修复“trim”错误
  • v1.0.2 2018-12-10过滤器增加“-trim”功能; 接受标签或底漆顺序不匹配, 当解复用时,接受不均匀读到 程序集;添加“-ds”以在 装配。
  • v1.0.1 2018-12-2增加“波兰”功能
  • 1.0.0版 HIFI-SE v1.0.0 2018年11月22日。以前版本的更改者:
    • 格式化的python代码编写风格为pep8。
    • 修正了几个小错误。
  • 第0.0.3版 HIFI-SE v0.03 2018年11月15日。与以前版本的更改:
    • 修改一些参数的描述,以便更好地理解。
  • 第0.0.1版 HIFI-SE v0.0.1 2018/11/03 BEAT版本,建立框架并存档几乎全部功能。

原始Perl版本&python,原始源代码

0.expected_error.pl
1.split_extract.pl
2.hificonnect.pl

0.expected_error.py
1.split_extract.py
2.hificonnect.py

安装

系统需求和依赖性

操作系统:HIFI-SE设计用于大多数平台,包括Unix、Linux和MacOS/X。Microsoft Windows。我们已经在linux和macos/x上进行了测试,因为这些是我们开发的机器。hifi-se是用python语言编写的,需要3.5或更高版本。

依赖项:

安装

  1. 我只在github上部署我的最新版本,因此您可以将此存储库克隆到本地计算机。但是,它无法解决软件包依赖性问题,因此在使用HiFi-SE软件之前,您需要安装Biopython和Bold_Identification。(注意:PIP是PIP3的链接)

    git clone https://github.com/comery/HIFI-barcode-SE400.git
    pip install biopython
    pip install bold_identification  
    
  2. 建议使用pip安装,因为它将自动解决包依赖关系,包括biopython和bold U标识包。

    pip install HIFI-SE

使用(最新)

python3 HIFI-SE.py

./HIFI-SE.py
usage: HIFI-SE [-h] [-v]
               {all,filter,assign,assembly,polish,bold_identification} ...

Description

    An automatic pipeline for HIFI-SE400 project, including filtering
    raw reads, assigning reads to samples, assembly HIFI barcodes
    (COI sequences), polished assemblies, and do tax identification.
    See more: https://github.com/comery/HIFI-barcode-SE400

Versions

    1.0.4 (20190402)

Authors

    yangchentao at genomics.cn, BGI.
    mengguanliang at genomics.cn, BGI.

positional arguments:
  {all,filter,assign,assembly,polish,bold_identification}
    all                 run filter, assign and assembly.
    filter              remove or trim reads with low quality.
    assign              assign reads to samples by tags.
    assembly            do assembly from assigned reads,
                        output raw HIFI barcodes.
    polish              polish COI barcode assemblies,
                        output confident barcodes.
    bold_identification
                        do taxa identification on BOLD system

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

按步骤运行[筛选->;分配->;程序集]

  • python3 HIFI-SE.py filter
usage: HIFI-SE filter [-h] -outpre <STR> -raw <STR> [-phred <INT>] [-e <INT>]
                      [-q <INT> <INT>] [-trim] [-n <INT>]

optional arguments:
  -h, --help      show this help message and exit

common arguments:
  -outpre <STR>   prefix for output files

filter arguments:
  -raw <STR>      input raw Single-End fastq file, and only
                  adapters should be removed; supposed on
                  Phred33 score system (BGISEQ-500)
  -phred <INT>    Phred score system, 33 or 64, default=33
  -e <INT>        expected error threshod, default=10
                  see more: http://drive5.com/usearch/manual/exp_errs.html
  -q <INT> <INT>  filter by base quality; for example: '20 5' means
                  dropping read which contains more than 5 percent of
                  quality score < 20 bases.
  -trim           whether to trim 5' end of read, it adapts to -e mode
                  or -q mode
  -n <INT>        remove reads containing [INT] Ns, default=1
  • python3 HIFI-SE.py assign
usage: HIFI-SE assign [-h] -outpre <STR> -index INT -fq <STR> -primer <STR>
                      [-outdir <STR>] [-tmis <INT>] [-pmis <INT>]

optional arguments:
  -h, --help     show this help message and exit

common arguments:
  -outpre <STR>  prefix for output files

index arguments:
  -index INT     the length of tag sequence in the ends of primers

when only run assign arguments:
  -fq <STR>      cleaned fastq file

assign arguments:
  -primer <STR>  taged-primer list, on following format:
                 Rev001   AAGCTAAACTTCAGGGTGACCAAAAAATCA
                 For001   AAGCGGTCAACAAATCATAAAGATATTGG
                 ...
                 this format is necessary!
  -outdir <STR>  output directory for assignment,default="assigned"
  -tmis <INT>    mismatch number in tag when demultiplexing, default=0
  -pmis <INT>    mismatch number in primer when demultiplexing, default=1
  • python3 HIFI-SE.py assembly
usage: HIFI-SE assembly [-h] -outpre <STR> -index INT -list FILE
                        [-vsearch <STR>] [-threads <INT>] [-cid FLOAT]
                        [-min INT] [-max INT] [-oid FLOAT] [-tp INT] [-ab INT]
                        [-seqs_lim INT] [-len INT] [-ds] [-mode INT] [-rc]
                        [-codon INT] [-frame INT]

optional arguments:
  -h, --help      show this help message and exit

common arguments:
  -outpre <STR>   prefix for output files

index arguments:
  -index INT      the length of tag sequence in the ends of primers

only run assembly arguments(not all):
  -list FILE      input file, fastq file list. [required]

software path:
  -vsearch <STR>  vsearch path(only needed if vsearch is not in $PATH)
  -threads <INT>  threads for vsearch, default=2
  -cid FLOAT      identity for clustering, default=0.98

assembly arguments:
  -min INT        minimun length of overlap, default=80
  -max INT        maximum length of overlap, default=90
  -oid FLOAT      minimun similarity of overlap region, default=0.95
  -tp INT         how many clusters will be used inassembly, recommend 2
  -ab INT         keep clusters to assembly if its abundance >=INT
  -seqs_lim INT   reads number limitation. by default,
                  no limitation for input reads
  -len INT        standard read length, default=400
  -ds             drop short reads away before assembly
  -mode INT       1 or 2; modle 1 is to cluster and keep
                  most [-tp] abundance clusters, or clusters
                  abundance more than [-ab], and then make a
                  consensus sequence for each cluster.
                  modle 2 is directly to make only one consensus
                  sequence without clustering. default=1
  -rc             whether to check amino acid
                  translation for reads, default not

translation arguments(when set -rc or -cc):
  -codon INT      codon usage table used to checktranslation, default=5
  -frame INT      start codon shift for amino acidtranslation, default=1

快速启动

教程中使用的文件

所有相关文件都可以在这里找到。教程的重要文件是:

  • raw.fastq.gz,从bgiseq-500 se400模块生成的raw输出fastq文件。
  • 索引的底漆列表,标记的底漆列表

运行“全部”

示例:

python3 HIFI-SE.py all -outpre hifi -trim -e 5 -raw test.raw.fastq -index 5 -primer index_primer.list -mode 1 -cid 0.98 -oid 0.95 -seqs_lim 50000 -threads 4 -tp 2

引文

这本书还没有出版,但很快就要出版了!出版后我会更新这一部分。

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java Apache Flink外部Jar   创建和强制转换对象数组时发生java错误   Java,添加数组   具有相同包结构和类的java JAR   java Jenkins未能构建Maven项目   java为什么一个forloop比另一个更快,尽管它们做的“一样”?   servlets在将“/”站点迁移到Java EE包时处理contextpath引用   无法解析java MavReplugin:2.21或其某个依赖项   泛型如何编写比较器来泛化Java中的两种类型的对象?   java Android Emulator未在netbeans上加载   多线程Java使用线程对数组中的数字求和:在同步块中使用新变量作为锁:差异   java如何在JSP/servlet中设置<input>标记的值?