用于读取、写入、合并和重新映射snp的工具
snps的Python项目详细描述
单核苷酸多态性
用于读取、写入、合并和重新映射snp的工具
能力
- 从各种直接对消费者(dtc)dna检测源中读取原始数据(基因型)文件
- 读取和写入构建36、37和38的vcf文件(例如,将23andMe转换为vcf)
- 合并来自不同DNA测试的原始数据文件,识别过程中不一致的SNP
- 在程序集/生成之间重新映射snp(例如,将snp从生成36转换为生成37等)
示例
下载示例数据
让我们从openSNP:
下载一些示例数据>>> from snps.resources import Resources >>> r = Resources() >>> paths = r.download_example_datasets() Downloading resources/662.23andme.340.txt.gz Downloading resources/662.ftdna-illumina.341.csv.gz
加载原始数据
加载23andMe原始数据文件:
>>> from snps import SNPs >>> s = SNPs('resources/662.23andme.340.txt.gz')
加载的snp可通过pandas.DataFrame:
>>> df = s.snps >>> df.columns.values array(['chrom', 'pos', 'genotype'], dtype=object) >>> df.index.name 'rsid' >>> len(df) 991786
snps还尝试检测数据的生成/程序集:
>>> s.build 37 >>> s.build_detected True >>> s.assembly 'GRCh37'
重新映射snp
让我们重新映射snp以更改程序集/生成:
>>> s.snps.loc["rs3094315"].pos 752566 >>> chromosomes_remapped, chromosomes_not_remapped = s.remap_snps(38) Downloading resources/GRCh37_GRCh38.tar.gz >>> s.build 38 >>> s.assembly 'GRCh38' >>> s.snps.loc["rs3094315"].pos 817186
snp可以在构建36(NCBI36)、构建37(GRCh37)和构建38之间重新映射。 (GRCh38)。
合并原始数据文件
数据集由来自两个不同DNA测试源的原始数据文件组成。让我们合并 这些文件使用SNPsCollection。
>>> from snps import SNPsCollection >>> sc = SNPsCollection("resources/662.ftdna-illumina.341.csv.gz", name="User662") Loading resources/662.ftdna-illumina.341.csv.gz >>> sc.build 36 >>> chromosomes_remapped, chromosomes_not_remapped = sc.remap_snps(37) Downloading resources/NCBI36_GRCh37.tar.gz >>> sc.snp_count 708092
随着数据的增加,将其与现有数据、SNP位置和基因型进行比较。 发现差异。(可通过参数调整差异阈值。)
>>> sc.load_snps(["resources/662.23andme.340.txt.gz"], discrepant_genotypes_threshold=300) Loading resources/662.23andme.340.txt.gz 27 SNP positions were discrepant; keeping original positions 151 SNP genotypes were discrepant; marking those as null >>> len(sc.discrepant_snps) # SNPs with discrepant positions and genotypes, dropping dups 169 >>> sc.snp_count 1006960
保存snp
好的,到目前为止,我们已经将snp重新映射到同一个构建并合并了来自两个文件的snp, 确定一路上的差异。让我们保存由超过1百万个+ snps到csv文件:
>>> saved_snps = sc.save_snps() Saving output/User662_GRCh37.csv
此外,让我们获取此程序集的引用序列,并将snp保存为vcf文件:
>>> saved_snps = sc.save_snps(vcf=True) Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.2.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.3.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.4.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.5.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.6.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.7.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.8.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.9.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.10.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.11.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.12.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.13.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.14.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.15.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.16.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.17.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.18.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.19.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.20.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.21.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.22.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.X.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.Y.fa.gz Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.MT.fa.gz Saving output/User662_GRCh37.vcf
所有output files都保存到 输出目录。
文件
文档是可用的here。
确认
感谢迈克·阿戈斯蒂诺,帕德玛·雷迪,凯文·阿瓦伊,openSNP,还有 Open Humans。