AnswerALS | Data Portal

Assay and Pipeline Information

The Assay and Pipeline Information page provides sample processing and pipeline analysis information. Below you can find sample processing protocols for each OMIC assay and the series of computational and bioinformatics analyses involved in generating the Answer ALS datasets.

Answer ALS Epigenomics

Assay Methods

The ATAC-seq library preparation has been conducted by Diagenode Diagnostics (https://www.diagenodediagnostics.com) ATAC-seq Profiling service (Diagenode Cat# G02060000). For each sample, library preparation was performed on 50.000 cells. Nuclei are isolated using optimized lysis conditions : RSB-ATAC buffer including 0.1% Igepal, 0.1% Tween20 and 0.01% digitonin. Nuclei are lysed and chromatin is tagmented using Tn5 enzyme (Illumina, ref # 15027865). Tagmented DNA is then purified using Diapure columns (Diagenode, C03040001). Libraries were prepared using NEBNext High-Fidelity 2X Master Mix (NEB Cat# M0541) from purified tagmented DNA (starting amount 10μl). Libraries were purified and double size-selected using Agencourt® AMPure® XP (Beckman Coulter) and quantified using QubitTM dsDNA HS Assay Kit (Thermo Fisher Scientific, Q32854). Finally, their fragment size was analyzed by High Sensitivity NGS Fragment Analysis Kit (DNF-474) on a Fragment AnalyzerTM (Advanced Analytical Technologies, Inc.). Libraries were pooled and sequenced on an Illumina Novaseq6000 with paired-end reads of 50bp length.

Data Pipeline

The quality of the sequencing was assessed using FastQC and the reads were aligned to GRCh38 genome build using Bowtie2. We identified open chromatin regions separately for each sample using the peak-calling software MACS2. ATAC-seq data quality was determined according to ENCODE guidelines. The fraction of properly mapped and paired reads was >95% in all samples. The distribution of fragment sizes across all samples revealed a clear nucleosome-free region and regular peaks corresponding to n-nucleosomal fractions for each sample. All samples also showed promoter and enhancer enrichment. Samples with high (>2 SD) mitochondrial DNA contamination were excluded (average mtDNA fraction: 0.07 ± 0.01). Link to pipeline here.

Answer ALS Proteomics

SWATH-MS

Assay Methods

Upon receival, the iPSC pellets were lyophilized and stored in -80 °C. A small portion of the pellet was taken in a 1.5ul microfuge tube and urea lysis buffer (8 M urea, 75 mM NaCl, 50 mM Tris-HCl, pH 8.3) was added in a ratio of 1:2 (pellet to buffer). The resuspended pellets were sonicated at 40% amplitude until the pellets were completely homogenized. Following homogenization, the samples were centrifuged for 5 minutes at 5000g at 4 °C and the supernatant was transferred into a new tube. Protein BCA assay was performed as per manufacturer’s instructions (Pierce™ BCA Protein Assay Kit, Catalog number: 23225). The volume required for 100ug protein was taken from the cell lysate. Protein was reduced using 10mM (final concentration) TCEP (tris(2-carboxyethyl)phosphine) at 56°C for 30 minutes followed by alkylation using 12mM (final concentration) Iodoacetamide (IAA) at room temperature for 50 minutes in dark. Urea concentration was brought down to 1.6M by the addition of 50 mM Tris-HCl, pH 8.3 and subsequently, enzyme trypsin (at 1:50 enzyme to the substrate) was used to digest the proteins. The enzyme-substrate mixture was incubated overnight at 37°C while on a shaker at 1400 rpm. The following day, the reaction was quenched by the addition of 10%Formic acid (1% final concentration) and the pH was ensured to be between 2-3. The sample was then centrifuged at 5000g at RT and the supernatant was transferred to a new tube. Sample desalting was performed using an oasis 96 well HLB plate following the manufacturer’s instructions (Waters, SKU: 186000309). Desalted samples were dried down in a speed vacuum and the dried peptides were reconstituted in 0.1% Formic acid in water. Peptide BCA was performed to obtain the accurate peptide concentration in each sample.
Profile-mode .wiff files from the shotgun data acquisition were converted to mzML format using the AB Sciex Data Converter (in proteinpilot mode) and then re-converted to mzXML format using ProteoWizard for peaklist generation. The MS2 spectra were queried against the reviewed canonical Swiss-Prot Human complete proteome database appended with iRT protein sequence and shuffled sequence decoys. All data were searched using the X!Tandem and Comet. The search parameters included the following criteria: static modifications of Carbamidomethyl (C) and variable modifications of Oxidation (M). The parent mass tolerance was set to be 50 p.p.m, and mono-isotopic fragment mass tolerance was 100 p.p.m (which was further filtered to be < 0.05 Da for building spectral library); tryptic peptides with up to three missed cleavages were allowed. The identified peptides were processed and analyzed through Trans-Proteomic Pipeline and were validated using the PeptideProphet scoring and the PeptideProphet results were statistically refined using iProphet. All the peptides were filtered at a false discovery rate (FDR) of 1% with peptide probability cutoff >=0.99. The raw spectral libraries were generated from all valid peptide spectrum matches and then refined into the non-redundant consensus libraries using SpectraST. For each peptide, the retention time was mapped into the iRT space with reference to a linear calibration constructed for each shotgun run. The MS assays, constructed from Top six most intense transitions (from ion series: b and y and charge states: 1,2) with Q1 range from 400 to 1,200 m/z excluding the precursor SWATH window, were used for targeted data analysis of SWATH maps.

Data Pipeline

SWATH-MS.wiff files from the data-independent acquisition were first converted to profile mzML using ProteoWizard v.3.0.6002. The whole process of SWATH-targeted data analysis was carried out using OpenSWATH running on an internal computing cluster. OpenSWATH utilizes a target-decoy scoring system such as mProphet to estimate the identification of FDR. The best scoring classifier that was built from the sample of most protein identifications was utilized in this study. Based on our final spectral library, OpenSWATH firstly identified the peak groups from all individual SWATH maps at a global peptide FDR of 1% and aligned them between SWATH maps based on the clustering behaviors of retention time in each run with a non-linear alignment algorithm. For this analysis, the MS runs were realigned to each other using LOcally WEighted Scatterplot Smoothing method and the peak group clustering was performed using ‘LocalMST’ method. Specifically, only those peptide peak groups that deviate within 3 standard deviations from the retention time were reported and considered for alignment with the max FDR quality of 5% (quality cutoff to still consider a feature for alignment). Next, to obtain high-quality quantitative data at the protein level, we discarded those proteins whose peptides were shared between multiple different proteins (non-proteotypic peptides). Data pre-processing and statistical analysis of MS runs into quantitative data was performed using mapDIA. The fragment-level intensities were normalized based on TIC (Total Ion Current) from each DIA MS run to remove systematic bias between MS runs. This is followed by outlier removal and peptide/fragment selection that preserves the major quantitative patterns across all samples for each protein. The selected fragments and peptides are used in the final model-based statistical significance analysis of protein-level differential expression between specified groups of samples.

SomaScan

7k SomaScan Assay v4.1

An aptamer-based proteomics assay that measures the relative abundance of 7,288 human proteins. The assay is designed for high-throughput applications and captures proteins spanning a wide range of endogenous concentrations. See technical notes in the documents folder, or the technical publication in SomaLogic links below for more information about the SomaScan assay

943 Plasma Samples from 358 participants (282 cases and 76 controls) with timepoints ~ 90 days apart

Timepoint 1: 308 samples

Timepoint 2: 187 samples

Timepoint 3: 184 samples

Timepoint 4: 136 samples

Timepoint 5: 128 samples

Answer ALS Transcriptomics

Assay Methods

The RNA-seq library preparation and sequencing have been done at the Genomics High Throughput Facility at the University of California, Irvine. For each sample, total RNA was isolated using the Qiagen RNeasy mini kit. RNA samples for each AALS subject (control or ALS) were entered into an electronic tracking system and processed. RNA QC was conducted using an Agilent Bioanalyzer and Nanodrop. Our primary QC metric for RNA quality is based on RIN values (RNA Integrity Number) ranging from 0-10, 10 being the highest quality RNA. Additionally, we collected QC data on total RNA concentration and 260/280 and 260/230 ratios to evaluate any potential contamination. Only samples with RIN >8 were used for library prep and sequencing. rRNAs were removed and libraries generated using TruSeq Stranded Total RNA library prep kit with Ribo-Zero (Illumina). RNA-Seq libraries were titrated by qPCR (Kapa), normalized according to size (Agilent Bioanalyzer 2100 High Sensitivity chip). Each cDNA library was then subjected to 100 Illumina (Novaseq 6000) paired-end (PE) sequencing cycles to obtain over 50 million PE reads per sample.

Data Pipeline

The quality of the sequencing was first assessed using fastQC tool (v.0.11.9). Raw reads were then adapter and quality trimmed and filtered by a length of 20 bases. Trimmed reads were mapped to the GRCh38 reference genome using Hisat2 (v.2.2.1) and resulting BAM files were indexed using samtools (v.1.10), and gene expression was quantified with featureCounts (subread v.2.0.1) as raw read count files.

Answer ALS Whole Genome Sequencing

Answer ALS genome sequencing is performed by New York Genome Center (NYGC). The methods they used are listed below.

Assay Methods

All samples undergo rigorous quality assessment using a comprehensive set of quality measures upon completion of each step of sample processing: 1) sample receipt, 2) library preparation, 3) sequencing, 4) data analysis. Quality control measures are scrutinized by the combined efforts of our Project Management, Laboratory, Sequencing Analytics, and Bioinformatics teams. Samples that do not meet our expected quality criteria are flagged and reviewed in consultation with the investigator prior to initiation of the next step of the sample processing pipeline.

DNA Preparation

Extraction DNA is extracted from whole blood or flash-frozen post-mortem tissue using QIAamp DNA Mini Kit (QIAamp #51104 and QIAamp#51306, respectively) according to the manufacturer’s recommendations.

Sample QC

- An automated volume check on investigator samples submitted in our 2D Matrix rack tubes is performed as part of our initial QC for each sample upon arrival in the laboratory. This information is matched with the sample volume information provided by the investigator to confirm that sample integrity was not compromised during shipment.
- DNA quantification
  Incoming nucleic acid samples are quantified using fluorescent-based assays (PicoGreen or Qubit) to accurately determine whether sufficient material is available for library preparation and sequencing.
- DNA integrity
  DNA sample size distributions are profiled by a Fragment Analyzer (Advanced Analytics) or BioAnalyzer (Agilent Technologies), to assess sample quality and integrity. Samples that contain degraded material and/or RNA contaminants, which could affect library preparation performance, are flagged.
  Samples that do not meet our initial criteria for QC undergo further review with our Project Management team in consultation with the investigator. Investigators are provided the opportunity to re-submit new or additional material for samples that do not pass our initial QC criteria.
- DNA fingerprinting and quality assessment using SNP genotyping arrays
  For WGS projects, an aliquot of DNA is separately aliquoted from the submission tube upon receipt for SNP array genotyping to determine DNA integrity and identity ahead of sequencing. The genotyping results are also checked for gross sample contamination and can reveal other forms of poor sample quality prior to sequencing.

Library Preparation

TruSeq PCR-Free
Whole genome sequencing (WGS) libraries are prepared using the Illumina TruSeq DNA PCR- free Library Preparation Kit in accordance with the manufacturer’s instructions. Briefly, 1ug of DNA is sheared using a Covaris LE220 sonicator (adaptive focused acoustics). DNA fragments undergo end-repair, bead-based size selection, adenylation, and Illumina sequencing adapter ligation.
We have 622 participant WGS that have been processed using the TruSeq PCR-Free method. For a list of those samples, please email terri@onpointsci.com. You must be approved for data access and have a DUA on file for access.
TruSeq Nano
Whole genome sequencing (WGS) libraries are prepared using the Illumina TruSeq Nano DNA Library Preparation Kit in accordance with manufacturer’s instructions. Briefly, 100ng of DNA is sheared using the Covaris LE220 sonicator (adaptive focused acoustics). DNA fragments undergo end-repair, bead-based size selection, adenylation, and Illumina sequencing adapter ligation.
Ligated DNA libraries are enriched with PCR amplification (using 8 cycles).
We have 244 participant WGS that have been processed using the TruSeq Nano method. For a list of those samples, please email terri@onpointsci.com. You must be approved for data access and have a DUA on file for access.

Library QC

Library Quantification
Picogreen is used to measure the total amount of DNA in the prepared library. Quantitative PCR (qPCR) uses specific oligos complimentary to Illumina’s TruSeq adapters to measure the amount of adapter-ligated DNA (ligation efficiency) that is compatible with sequencing.
Library Size distribution
Size distribution profiles of the final libraries are assessed using the Fragment analyzer/Bioanalyzer. Libraries that fall outside of the expected size range and/or contain adapter dimer contaminants are flagged.

HiSeq X

Sequencing is performed on an Illumina HiSeq X sequencer (v2.5 chemistry) using 2x150bp cycles.

NovaSeq

Sequencing is performed on an Illumina NovaSeq 6000 sequencer using 2x150bp cycles.

Sequencing QC
All sequencing runs are reviewed for quality by our Sequencing Analytics team. Sequencing runs that do not pass our quality criteria for each of the metrics below are flagged and reviewed in consultation with Illumina Technical Support.

WGS coverage metrics
To ensure complete, unbiased coverage of the genome, a comprehensive panel of coverage metrics is assessed, including mean genome coverage, % of genome covered (at coverage levels 1x, 10x, 20x, etc), and coverage by chromosome. Coverage calculations are based on numbers of uniquely mapped PF reads (excluding duplicate reads). The mean coverage per sample must meet the expected target coverage for a given sequencing application and analysis type as per the project definition. Samples that do not meet the expected mean coverage are queued for additional sequencing.

Recalibrated base quality
Base Quality Score Recalibration (BQSR) is part of the Genome Analysis Toolkit (GATK) Best Practices’ principles for preprocessing of aligned sequencing reads and results in adjustment of the quality scores of all reads to more accurately reflect the probability of a mismatch to the reference genome and improvements to variant calling. BQSR quality scores are graphed along to original Illumina quality scores for each sample and project to assess the effective quality of sequencing data generated.

Sample contamination
Contamination checks are performed on the sequencing data to verify that the sequenced library originates from a single individual/sample. Samples in which a detectable level of contamination is identified are flagged for review. This information is used to identify sample swaps and to confirm sample identity (in combination with genotyping data).

Gender concordance
Gender is confirmed by comparing the gender information obtained from the sequencing and SNP array genotyping data, to the gender specified in the sample submission form.

Concordance to SNP array
The identity of the sequenced sample is confirmed by evaluating the concordance between sequencing and genotyping data.

Variant evaluation metrics
For SNVs and Indels, we use GATK VariantEval metrics to assess variants called per sample by: 1) the ratio of novel to known variants, 2) the TiTv ratios (for SNVs), 3) the heterozygous to homozygous ratios.

Data Pipeline
Sequence data were processed on an NYGC automated pipeline. Paired-end 150-bp reads were aligned to the GRCh38 human reference using the Burrows-Wheeler Aligner (BWA-MEMv0.7.8) and processed using the GATK best-practices workflow, which includes the marking of duplicate reads with Picard tools (v1.83, http://picard.sourceforge.net), local realignment around indels, and base quality score recalibration (BQSR) via Genome Analysis Toolkit (GATK v. 3.8.0). The joint genotyping calling was run using Sentieon v. 201911 (https://www.sentieon.com), variant quality score recalibration (VQSR) was done using GATK v. 3.8.0 (truth sensitivity level = 99.0).
The annotation pipeline was customized to incorporate elements from ANNOVAR (v. 2018Apr16), from which a report was generated that included the genotypes of all samples. The annotated genes and exonic variants that have clinical significance, we incorporated the Clinical Genomic Database (CGD), the Online Mendelian Inheritance in Man (OMIM), ClinVar, and genes listed in the American College of Medical Genetics and Genomics (ACMG) database as well. Intervar, which is based upon the ACMG and AMP standards and guidelines for interpretation of variants, was also incorporated. This tool uses 18 criteria to ascribe clinical significance and classifies genes based on a five-tier system. For each variant, we also incorporated functional in silico predictions from nine programs, including databases such as SIFT, PolyPhen2, and Mutation Taster and those described in Li et al., 2013. Additional databases were included that assess the variant tolerance of each gene using the RVIS and the Gene Damage Index (GDI). For variants in genes that are highly expressed in the brain, we incorporated data from the Human Protein Atlas (http://www.proteinatlas.org) and expression data from the GTEx portal (https://gtexportal.org/home/) for the cortex and spinal cord. Frequency information from three databases on all known variants from ExAC (now gnomAD), the NHLBI Exome Sequencing Project (ESP), and the 1000 Genomes Project.

Expansion Hunter Tool
The C9orf72 and Ataxin 2 (ATXN2) repeat lengths were estimated using Expansion Hunter (v2.5.5.) Expansion Hunter is a tool for estimating repeat sizes within the genome. The tool estimates sizes of repeats by performing a targeted search through a BAM/CRAM file for reads that span, flank, and are fully contained in each repeat. For more information on this tool please refer to the article.

More Information
Expansion Hunter version used: EH v2.5.5.
The reference file: “GRCh38_full_analysis_set_plus_decoy_hla.fa” released on the 1000 Genomes EBI ftp site.

The EH variant catalog is available upon request from NYGC.

ALS TDI Transcriptomics

Assay Methods

Sample Preparation

Patient blood samples were preserved in PAXgene® Blood RNA tubes and stored at -80°C until further processing. Total RNA isolation, including DNase treatment, was performed in-house using the PreAnalytiX® PAXgene® Blood RNA Kit. Preliminary RNA quality values were taken using the NanoDrop OneC spectrophotometer, and RNA samples were labeled and catalogued using the in-house Laboratory Information Management System. Samples were stored at -80°C until the time of shipment. Samples were shipped on dry ice to Novogene (Sacramento, CA) for in-depth QC, library prep, and sequencing. QC was performed with the Agilent 2100 bioanalyzer, NanoDrop spectrophotometer, and agarose gel; samples with RIN greater than or equal to 5.8 and quantity greater than 400 ng passed QC. Unstranded library preparation was completed using the Fast RNA-seq Lib Prep Kit V2 (ABclonal, #RK20306). Samples were then sequenced with the NovaSeq X Plus Series (PE150).

Data Pipeline

Pipeline to preprocess bulk RNAseq data is publicly available at: https://github.com/ALSTDI/pipeline_bulk_RNAseq_preprocessing_v2024_09/tree/main. In brief, we used nextflow (v24.04.4) and incorporated adapter and quality trimming with fastp (v0.23.4). Trimmed reads were then mapped to the GRCh38 reference genome using STAR (v2.7.11b), which produced BAM files. Gene expression was quantified with RSEM (v1.3.1) as raw read count files.

ALS TDI Whole Genome Sequenceing

ALS TDI genome sequencing is performed by the Broad Institute using their Human Whole Genome Sequencing PCR-Free protocol (v1.1-v1.3), with more details below.

Assay Methods

DNA Preparation

An automated volume check on investigator samples submitted in our 2D Matrix rack tubes is performed as part of our initial QC for each sample upon arrival in the laboratory. This information is matched with the sample volume information provided by the investigator to confirm that sample integrity was not compromised during shipment.

Incoming nucleic acid samples are quantified using fluorescent-based assays (PicoGreen or Qubit) to accurately determine whether sufficient material is available for library preparation and sequencing.

For WGS projects, an aliquot of DNA is separately aliquoted from the submission tube upon receipt for SNP array genotyping to determine DNA integrity and identity ahead of sequencing. The genotyping results are also checked for gross sample contamination and can reveal other forms of poor sample quality prior to sequencing.

Data Pipeline

The pipeline to preprocess WGS data is publicly available at: https://github.com/ALSTDI/pipeline_WGS_v2024_10/tree/main. In brief, sequence data were delivered from the Broad Institute in the form of .cram and .crai. Paired-end FASTQ reads were re-generated from such .cram and .crai files using `samtools`. Finally, FASTQ files were processed with nextflow (v24.04.4) using the nf-core/sarek pipeline (v3.4.4-g5cc3049), which is publicly available at https://github.com/nf-core/sarek.