Assay and Pipeline Information

The Assay and Pipeline Information page provides sample processing and pipeline analysis information. Below you can find sample processing protocols for each OMIC assay and the series of computational and bioinformatics analyses involved in generating the Answer ALS datasets.

Assay Methods

The ATAC-seq library preparation has been conducted by Diagenode Diagnostics (https://www.diagenodediagnostics.com) ATAC-seq Profiling service (Diagenode Cat# G02060000). For each sample, library preparation was performed on 50.000 cells. Nuclei are isolated using optimized lysis conditions : RSB-ATAC buffer including 0.1% Igepal, 0.1% Tween20 and 0.01% digitonin. Nuclei are lysed and chromatin is tagmented using Tn5 enzyme (Illumina, ref # 15027865). Tagmented DNA is then purified using Diapure columns (Diagenode, C03040001). Libraries were prepared using NEBNext High-Fidelity 2X Master Mix (NEB Cat# M0541) from purified tagmented DNA (starting amount 10μl). Libraries were purified and double size-selected using Agencourt® AMPure® XP (Beckman Coulter) and quantified using QubitTM dsDNA HS Assay Kit (Thermo Fisher Scientific, Q32854). Finally, their fragment size was analyzed by High Sensitivity NGS Fragment Analysis Kit (DNF-474) on a Fragment AnalyzerTM (Advanced Analytical Technologies, Inc.). Libraries were pooled and sequenced on an Illumina Novaseq6000 with paired-end reads of 50bp length.

Data Pipeline

The quality of the sequencing was assessed using FastQC and the reads were aligned to GRCh38 genome build using Bowtie2. We identified open chromatin regions separately for each sample using the peak-calling software MACS2. ATAC-seq data quality was determined according to ENCODE guidelines. The fraction of properly mapped and paired reads was >95% in all samples. The distribution of fragment sizes across all samples revealed a clear nucleosome-free region and regular peaks corresponding to n-nucleosomal fractions for each sample. All samples also showed promoter and enhancer enrichment. Samples with high (>2 SD) mitochondrial DNA contamination were excluded (average mtDNA fraction: 0.07 ± 0.01). Link to pipeline here.

Assay Methods

Upon receival, the iPSC pellets were lyophilized and stored in -80 °C. A small portion of the pellet was taken in a 1.5ul microfuge tube and urea lysis buffer (8 M urea, 75 mM NaCl, 50 mM Tris-HCl, pH 8.3) was added in a ratio of 1:2 (pellet to buffer). The resuspended pellets were sonicated at 40% amplitude until the pellets were completely homogenized. Following homogenization, the samples were centrifuged for 5 minutes at 5000g at 4 °C and the supernatant was transferred into a new tube. Protein BCA assay was performed as per manufacturer’s instructions (Pierce™ BCA Protein Assay Kit, Catalog number: 23225). The volume required for 100ug protein was taken from the cell lysate. Protein was reduced using 10mM (final concentration) TCEP (tris(2-carboxyethyl)phosphine) at 56°C for 30 minutes followed by alkylation using 12mM (final concentration) Iodoacetamide (IAA) at room temperature for 50 minutes in dark. Urea concentration was brought down to 1.6M by the addition of 50 mM Tris-HCl, pH 8.3 and subsequently, enzyme trypsin (at 1:50 enzyme to the substrate) was used to digest the proteins. The enzyme-substrate mixture was incubated overnight at 37°C while on a shaker at 1400 rpm. The following day, the reaction was quenched by the addition of 10%Formic acid (1% final concentration) and the pH was ensured to be between 2-3. The sample was then centrifuged at 5000g at RT and the supernatant was transferred to a new tube. Sample desalting was performed using an oasis 96 well HLB plate following the manufacturer’s instructions (Waters, SKU: 186000309). Desalted samples were dried down in a speed vacuum and the dried peptides were reconstituted in 0.1% Formic acid in water. Peptide BCA was performed to obtain the accurate peptide concentration in each sample.
Profile-mode .wiff files from the shotgun data acquisition were converted to mzML format using the AB Sciex Data Converter (in proteinpilot mode) and then re-converted to mzXML format using ProteoWizard for peaklist generation. The MS2 spectra were queried against the reviewed canonical Swiss-Prot Human complete proteome database appended with iRT protein sequence and shuffled sequence decoys. All data were searched using the X!Tandem and Comet. The search parameters included the following criteria: static modifications of Carbamidomethyl (C) and variable modifications of Oxidation (M). The parent mass tolerance was set to be 50 p.p.m, and mono-isotopic fragment mass tolerance was 100 p.p.m (which was further filtered to be < 0.05 Da for building spectral library); tryptic peptides with up to three missed cleavages were allowed. The identified peptides were processed and analyzed through Trans-Proteomic Pipeline and were validated using the PeptideProphet scoring and the PeptideProphet results were statistically refined using iProphet. All the peptides were filtered at a false discovery rate (FDR) of 1% with peptide probability cutoff >=0.99. The raw spectral libraries were generated from all valid peptide spectrum matches and then refined into the non-redundant consensus libraries using SpectraST. For each peptide, the retention time was mapped into the iRT space with reference to a linear calibration constructed for each shotgun run. The MS assays, constructed from Top six most intense transitions (from ion series: b and y and charge states: 1,2) with Q1 range from 400 to 1,200 m/z excluding the precursor SWATH window, were used for targeted data analysis of SWATH maps.

Data Pipeline

SWATH-MS.wiff files from the data-independent acquisition were first converted to profile mzML using ProteoWizard v.3.0.6002. The whole process of SWATH-targeted data analysis was carried out using OpenSWATH running on an internal computing cluster. OpenSWATH utilizes a target-decoy scoring system such as mProphet to estimate the identification of FDR. The best scoring classifier that was built from the sample of most protein identifications was utilized in this study. Based on our final spectral library, OpenSWATH firstly identified the peak groups from all individual SWATH maps at a global peptide FDR of 1% and aligned them between SWATH maps based on the clustering behaviors of retention time in each run with a non-linear alignment algorithm. For this analysis, the MS runs were realigned to each other using LOcally WEighted Scatterplot Smoothing method and the peak group clustering was performed using ‘LocalMST’ method. Specifically, only those peptide peak groups that deviate within 3 standard deviations from the retention time were reported and considered for alignment with the max FDR quality of 5% (quality cutoff to still consider a feature for alignment). Next, to obtain high-quality quantitative data at the protein level, we discarded those proteins whose peptides were shared between multiple different proteins (non-proteotypic peptides). Data pre-processing and statistical analysis of MS runs into quantitative data was performed using mapDIA. The fragment-level intensities were normalized based on TIC (Total Ion Current) from each DIA MS run to remove systematic bias between MS runs. This is followed by outlier removal and peptide/fragment selection that preserves the major quantitative patterns across all samples for each protein. The selected fragments and peptides are used in the final model-based statistical significance analysis of protein-level differential expression between specified groups of samples.

Assay Methods

The RNA-seq library preparation and sequencing have been done at the Genomics High Throughput Facility at the University of California, Irvine. For each sample, total RNA was isolated using the Qiagen RNeasy mini kit. RNA samples for each AALS subject (control or ALS) were entered into an electronic tracking system and processed. RNA QC was conducted using an Agilent Bioanalyzer and Nanodrop. Our primary QC metric for RNA quality is based on RIN values (RNA Integrity Number) ranging from 0-10, 10 being the highest quality RNA. Additionally, we collected QC data on total RNA concentration and 260/280 and 260/230 ratios to evaluate any potential contamination. Only samples with RIN >8 were used for library prep and sequencing. rRNAs were removed and libraries generated using TruSeq Stranded Total RNA library prep kit with Ribo-Zero (Illumina). RNA-Seq libraries were titrated by qPCR (Kapa), normalized according to size (Agilent Bioanalyzer 2100 High Sensitivity chip). Each cDNA library was then subjected to 100 Illumina (Novaseq 6000) paired-end (PE) sequencing cycles to obtain over 50 million PE reads per sample.

Data Pipeline

The quality of the sequencing was first assessed using fastQC tool (v.0.11.9). Raw reads were then adapter and quality trimmed and filtered by a length of 20 bases. Trimmed reads were mapped to the GRCh38 reference genome using Hisat2 (v.2.2.1) and resulting BAM files were indexed using samtools (v.1.10), and gene expression was quantified with featureCounts (subread v.2.0.1) as raw read count files.

Answer ALS genome sequencing is performed by New York Genome Center (NYGC). The methods they used are listed below.

Assay Methods

All samples undergo rigorous quality assessment using a comprehensive set of quality measures upon completion of each step of sample processing: 1) sample receipt, 2) library preparation, 3) sequencing, 4) data analysis. Quality control measures are scrutinized by the combined efforts of our Project Management, Laboratory, Sequencing Analytics, and Bioinformatics teams. Samples that do not meet our expected quality criteria are flagged and reviewed in consultation with the investigator prior to initiation of the next step of the sample processing pipeline.

DNA Preparation

  • Extraction DNA is extracted from whole blood or flash-frozen post-mortem tissue using QIAamp DNA Mini Kit (QIAamp #51104 and QIAamp#51306, respectively) according to the manufacturer’s recommendations.

Sample QC

    • An automated volume check on investigator samples submitted in our 2D Matrix rack tubes is performed as part of our initial QC for each sample upon arrival in the laboratory. This information is matched with the sample volume information provided by the investigator to confirm that sample integrity was not compromised during shipment.
    • DNA quantification
      Incoming nucleic acid samples are quantified using fluorescent-based assays (PicoGreen or Qubit) to accurately determine whether sufficient material is available for library preparation and sequencing.
    • DNA integrity
      DNA sample size distributions are profiled by a Fragment Analyzer (Advanced Analytics) or BioAnalyzer (Agilent Technologies), to assess sample quality and integrity. Samples that contain degraded material and/or RNA contaminants, which could affect library preparation performance, are flagged.
      Samples that do not meet our initial criteria for QC undergo further review with our Project Management team in consultation with the investigator. Investigators are provided the opportunity to re-submit new or additional material for samples that do not pass our initial QC criteria.
    • DNA fingerprinting and quality assessment using SNP genotyping arrays
      For WGS projects, an aliquot of DNA is separately aliquoted from the submission tube upon receipt for SNP array genotyping to determine DNA integrity and identity ahead of sequencing. The genotyping results are also checked for gross sample contamination and can reveal other forms of poor sample quality prior to sequencing.

Library Preparation

  • TruSeq PCR-Free
    Whole genome sequencing (WGS) libraries are prepared using the Illumina TruSeq DNA PCR- free Library Preparation Kit in accordance with the manufacturer’s instructions. Briefly, 1ug of DNA is sheared using a Covaris LE220 sonicator (adaptive focused acoustics). DNA fragments undergo end-repair, bead-based size selection, adenylation, and Illumina sequencing adapter ligation.
    We have 622 participant WGS that have been processed using the TruSeq PCR-Free method. For a list of those samples, please email terri@onpointsci.com. You must be approved for data access and have a DUA on file for access.
  • TruSeq Nano
    Whole genome sequencing (WGS) libraries are prepared using the Illumina TruSeq Nano DNA Library Preparation Kit in accordance with manufacturer’s instructions. Briefly, 100ng of DNA is sheared using the Covaris LE220 sonicator (adaptive focused acoustics). DNA fragments undergo end-repair, bead-based size selection, adenylation, and Illumina sequencing adapter ligation.
    Ligated DNA libraries are enriched with PCR amplification (using 8 cycles).
    We have 244 participant WGS that have been processed using the TruSeq Nano method. For a list of those samples, please email terri@onpointsci.com. You must be approved for data access and have a DUA on file for access.

Library QC

  • Library Quantification
    Picogreen is used to measure the total amount of DNA in the prepared library. Quantitative PCR (qPCR) uses specific oligos complimentary to Illumina’s TruSeq adapters to measure the amount of adapter-ligated DNA (ligation efficiency) that is compatible with sequencing.
  • Library Size distribution
    Size distribution profiles of the final libraries are assessed using the Fragment analyzer/Bioanalyzer. Libraries that fall outside of the expected size range and/or contain adapter dimer contaminants are flagged.

HiSeq X

Sequencing is performed on an Illumina HiSeq X sequencer (v2.5 chemistry) using 2x150bp cycles.

NovaSeq

Sequencing is performed on an Illumina NovaSeq 6000 sequencer using 2x150bp cycles.

Sequencing QC
All sequencing runs are reviewed for quality by our Sequencing Analytics team. Sequencing runs that do not pass our quality criteria for each of the metrics below are flagged and reviewed in consultation with Illumina Technical Support.

  • % Pass Filter (PF) clusters
    Library cluster efficiency should fall within the optimal range expected for instrument and flowcell type. PF percentages outside the expected range indicate either incorrect loading concentrations or problems with a particular sequencing run.
  • % sample de-multiplexed
    All PF reads within a single lane of a flowcell are assigned to a specific barcoded library based on the indexed read. The percentage of reads within a lane that are assigned to each sample after de-multiplexing is assessed to confirm expected sample distribution within the sample pool.
  • # of PF reads/sample
    The total number of PF reads per sample must meet the expected number of reads for a given sequencing application and analysis type, as discussed upfront with the investigator. Samples that do not meet the expected number of reads are queued for additional sequencing.
  • % bases >Q30
    To ensure the highest quality sequencing data, FASTQ data in which at least 75% (HiSeq X) or 80% (HiSeq 2500) of bases have an Illumina Quality score >30 (a Phred like score indicating an expected 99.9% base call accuracy) are selected and used in downstream analysis.
  • Quality by cycle
    Assessment of the quality score by cycle is used to verify that the accuracy of called bases is maintained across the entire length of the sequencing read.
  • GC content
    GC content is reflective of both sample and library type. GC content can vary between organisms, and can be an indicator of poor sequence quality attributable to biases introduced during library preparation.
  • K-mer content/adapter contamination
    FASTQC data is examined to identify over-represented sequences in the sequencing data, including k-mers and reads that align to the Illumina adapter sequences, both of which could indicate poor library quality and result in uneven base composition.
  • Alignment metrics – PF reads, PF aligned, PF unique aligned
    The number of PF reads and the fraction of these that are aligned to the reference genome (PF aligned) and map to unique sequences (PF unique aligned) are reviewed. The number of PF uniquely aligned reads has a direct impact on the mean coverage obtained for a sequenced library. Libraries that yield a lower proportion of uniquely aligned PF reads could indicate poor library or sequencing run quality.
  • Insert size metrics – mean, median and distribution
    The mean, median, and distribution of the insert size of paired-end libraries (number of bases between the 5’ and 3’ adapters) are examined. Smaller insert sizes lead to overlapping reads and/or sequencing into the adapter sequences, limiting the number of usable bases for mapping and downstream analysis.
  • % Duplication
    PCR amplification during library preparation (and flow cell clustering) can give rise to the sequencing of duplicate reads. Libraries that produce a higher than expected number of duplicate reads yield reduced coverage (DNA) or higher potential bias (RNA).

WGS coverage metrics
To ensure complete, unbiased coverage of the genome, a comprehensive panel of coverage metrics is assessed, including mean genome coverage, % of genome covered (at coverage levels 1x, 10x, 20x, etc), and coverage by chromosome. Coverage calculations are based on numbers of uniquely mapped PF reads (excluding duplicate reads). The mean coverage per sample must meet the expected target coverage for a given sequencing application and analysis type as per the project definition. Samples that do not meet the expected mean coverage are queued for additional sequencing.

Recalibrated base quality
Base Quality Score Recalibration (BQSR) is part of the Genome Analysis Toolkit (GATK) Best Practices’ principles for preprocessing of aligned sequencing reads and results in adjustment of the quality scores of all reads to more accurately reflect the probability of a mismatch to the reference genome and improvements to variant calling. BQSR quality scores are graphed along to original Illumina quality scores for each sample and project to assess the effective quality of sequencing data generated.

Sample contamination
Contamination checks are performed on the sequencing data to verify that the sequenced library originates from a single individual/sample. Samples in which a detectable level of contamination is identified are flagged for review. This information is used to identify sample swaps and to confirm sample identity (in combination with genotyping data).

Gender concordance
Gender is confirmed by comparing the gender information obtained from the sequencing and SNP array genotyping data, to the gender specified in the sample submission form.

Concordance to SNP array
The identity of the sequenced sample is confirmed by evaluating the concordance between sequencing and genotyping data.

Variant evaluation metrics
For SNVs and Indels, we use GATK VariantEval metrics to assess variants called per sample by: 1) the ratio of novel to known variants, 2) the TiTv ratios (for SNVs), 3) the heterozygous to homozygous ratios.

Data Pipeline
Sequence data were processed on an NYGC automated pipeline. Paired-end 150-bp reads were aligned to the GRCh38 human reference using the Burrows-Wheeler Aligner (BWA-MEMv0.7.8) and processed using the GATK best-practices workflow, which includes the marking of duplicate reads with Picard tools (v1.83, http://picard.sourceforge.net), local realignment around indels, and base quality score recalibration (BQSR) via Genome Analysis Toolkit (GATK v. 3.8.0). The joint genotyping calling was run using Sentieon v. 201911 (https://www.sentieon.com), variant quality score recalibration (VQSR) was done using GATK v. 3.8.0 (truth sensitivity level = 99.0).
The annotation pipeline was customized to incorporate elements from ANNOVAR (v. 2018Apr16), from which a report was generated that included the genotypes of all samples. The annotated genes and exonic variants that have clinical significance, we incorporated the Clinical Genomic Database (CGD), the Online Mendelian Inheritance in Man (OMIM), ClinVar, and genes listed in the American College of Medical Genetics and Genomics (ACMG) database as well. Intervar, which is based upon the ACMG and AMP standards and guidelines for interpretation of variants, was also incorporated. This tool uses 18 criteria to ascribe clinical significance and classifies genes based on a five-tier system. For each variant, we also incorporated functional in silico predictions from nine programs, including databases such as SIFT, PolyPhen2, and Mutation Taster and those described in Li et al., 2013. Additional databases were included that assess the variant tolerance of each gene using the RVIS and the Gene Damage Index (GDI). For variants in genes that are highly expressed in the brain, we incorporated data from the Human Protein Atlas (http://www.proteinatlas.org) and expression data from the GTEx portal (https://gtexportal.org/home/) for the cortex and spinal cord. Frequency information from three databases on all known variants from ExAC (now gnomAD), the NHLBI Exome Sequencing Project (ESP), and the 1000 Genomes Project.

Expansion Hunter Tool
The C9orf72 and Ataxin 2 (ATXN2) repeat lengths were estimated using Expansion Hunter (v2.5.5.) Expansion Hunter is a tool for estimating repeat sizes within the genome. The tool estimates sizes of repeats by performing a targeted search through a BAM/CRAM file for reads that span, flank, and are fully contained in each repeat. For more information on this tool please refer to the article.

More Information
Expansion Hunter version used: EH v2.5.5.
The reference file: “GRCh38_full_analysis_set_plus_decoy_hla.fa” released on the 1000 Genomes EBI ftp site.

The EH variant catalog is available upon request from NYGC.