snps tutorial.ipynb
A tutorial describing how to use the snps Python package to work with genotype files.
Let's download some example data from openSNP:
Load a 23andMe raw data file:
The loaded SNPs are available via a pandas.DataFrame:
snps also attempts to detect the build / assembly of the data:
Let's remap the SNPs to change the assembly / build:
SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).
The dataset consists of raw data files from two different DNA testing companies. Let's combine these files using a SNPsCollection.
As the data gets added, it's compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)
Ok, so far we've remapped the SNPs to the same build and merged the SNPs from two files, identifying discrepancies along the way. Let's save the merged dataset consisting of over 1M+ SNPs to a CSV file:
Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:
All output files are saved to the output
directory.
! pip install snps
import ipywidgets as widgets
import bz2
import gzip
import os
import requests
import tempfile
from io import StringIO
from ipywidgets import Layout
from ohapi import api
from snps import SNPs
Let's download some example data from openSNP:
from snps.resources import Resources
r = Resources()
paths = r.download_example_datasets()
Load a 23andMe raw data file:
The loaded SNPs are available via a pandas.DataFrame:
s = SNPs('resources/662.23andme.340.txt.gz')
df = s.snps
df.columns.values
df.index.name
len(df)
snps also attempts to detect the build / assembly of the data:
s.build
s.build_detected
s.assembly
Let's remap the SNPs to change the assembly / build:
# find the position of a named snp
s.snps.loc["rs3094315"].pos
chromosomes_remapped, chromosomes_not_remapped = s.remap_snps(38)
s.build
s.assembly
s.snps.loc["rs3094315"].pos
SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).
The dataset consists of raw data files from two different DNA testing companies. Let's combine these files using a SNPsCollection.
from snps import SNPsCollection
sc = SNPsCollection("resources/662.ftdna-illumina.341.csv.gz", name="User662")
sc.build
chromosomes_remapped, chromosomes_not_remapped = sc.remap_snps(37)
sc.snp_count
As the data gets added, it's compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)
sc.load_snps(["resources/662.23andme.340.txt.gz"], discrepant_genotypes_threshold=300)
len(sc.discrepant_snps) # SNPs with discrepant positions and genotypes, dropping dups
sc.snp_count
Ok, so far we've remapped the SNPs to the same build and merged the SNPs from two files, identifying discrepancies along the way. Let's save the merged dataset consisting of over 1M+ SNPs to a CSV file:
saved_snps = sc.save_snps()
Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:
r = Resources(resources_dir='ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/')
saved_snps = sc.save_snps(vcf=True)
All output files are saved to the output
directory.
user = api.exchange_oauth2_member(os.environ.get('OH_ACCESS_TOKEN'))
# 128 - 23andMe Upload
# 129 - AncestryDNA Upload
# 120 - FamilyTreeDNA Upload
# 40 - Gencove Upload
# 131 - Genome/Exome Upload
# 92 - Imputer
sources = [
'direct-sharing-128',
'direct-sharing-129',
'direct-sharing-120',
'direct-sharing-40',
'direct-sharing-131',
'direct-sharing-92',
]
available_sources = []
for record in user['data']:
if record['source'] in sources:
description = f"{record['basename']}--{record['metadata']['description']}--{record['id']}"
download_url = record['download_url']
available_sources.append((description, download_url))
selected_file = widgets.Dropdown(
options=available_sources,
description='Select a Genotype File:',
disabled=False,
layout=Layout(width='80%', height='50px'),
style = {'description_width': 'initial'}
)
display(selected_file)
def process_selected_file(selected_file_url):
datafile = requests.get(selected_file_url)
try:
try:
try:
return bz2.decompress(datafile.content)
#handle.write(textobj)
except OSError:
return gzip.decompress(datafile.content)
#handle.write(textobj)
except OSError:
return datafile.content
except Exception as e:
raise e
return None
processed_file = process_selected_file(selected_file.value)
tmp = tempfile.NamedTemporaryFile()
tmp.write(processed_file);
s = SNPs(tmp.name)
# close the temporary file (in Open Humans notebook only)
tmp.close()