Our groups have been at the forefront of designing integrated learning strategies for scRNA-seq, including published and impactful methods to infer spatial localization based on gene expression, and to quantify and disentangle confounding sources of variation. Most recently, the Satija and Marioni labs published the first analytical strategies to learn biological manifolds jointly across scRNA-seq datasets generated from different individuals, technologies, and species. While developed independently, these methods leverage distinct yet complementary machine learning approaches based on 'manifold alignment' (including canonical correlation analysis (CCA), non-linear warping algorithms, cosine normalization, factor analysis and mutual nearest neighbor identification), to align subpopulations that are shared across datasets (Haghverdi et al.; Butler and Satjia). Notably, combining datasets significantly boosts statistical power compared to independent analyses, and can lead to the discovery of new subpopulations and differentially expressed genes. The Stegle lab has also recently introduced a novel statistical framework based on scalable latent-factor variable models to decompose technical sources of variation from biological heterogeneity, a crucial step for data integration (Buettner et al.).
While promising, however, these methods represent only initial solutions to key challenges. Improved performance will require an exchange of ideas, collaborative optimization, and rigorous benchmarking on both simulated and diverse real datasets. We note that instead of pursuing these goals independently, this RFA provides an opportunity to work towards these goals together, and we welcome additional collaboration with other groups across the HCA community interested in these challenges.
4. Proposed Work and Deliverables
Aim 1 : Collaborative development of methods, metrics, and benchmarks for scRNA-seq integration.
We will aim to collaboratively develop best-in-class methods for aligning distinct datasets, building upon our previous work. First, we will combine innovative aspects of both approaches to improve performance, for example, utilizing CCA and factor models in combination with cosine normalization to help define mutual nearest neighbors across datasets. Second, we will extend the methods to perform true multiple alignment across all input datasets, instead of pairwise alignment to a reference. Here, we will exploit ideas originating from multiple sequence alignment, designing and optimizing an objective function that aligns shared cell phenotypes across all datasets by identifying different aspects of common variation, aided by principles borrowed from multi-CCA and agglomerative clustering. Third, existing metrics designed to evaluate batch effect correction are poorly suited to heterogeneous scRNA-seq data. Consequently, we will develop entropy-based metrics that reward a shared manifold across all datasets, without sacrificing structure (blurring cell types) during the alignment. These metrics will provide the necessary means for objectively comparing and assessing alternative methods in different contexts.
To assess performance, we will use both synthetic and real datasets. Synthetic data will be generated using Splatter (Zappia et al.), and will be designed to focus especially on the setting where a rare sub-population of cells is present in only one batch. This situation is potentially problematic for any data integration strategy: we have noticed that our existing methods tend to collapse rare populations. Given this, we will ensure that the identifiability of these populations is properly reflected in the new metrics we define, thus motivating the development of integration methods that do not obscure such signals. Additionally, five recently published scRNA-seq of human pancreatic islets encompass multiple technologies (SMART-Seq2/CelSeq/CelSeq-2/inDrop), and feature abundant and rare populations with canonical markers, thus forming an ideal human dataset for benchmarking.
Deliverables from this Aim will be a GitHub repository, containing code for the newly devised integration strategies, alongside clear and reproducible examples. This repository will also contain simulated data generated to reflect a variety of real life settings, and the newly-defined metrics for assessing the performance of integration strategies. Benchmark datasets and tests cases will be distributed using the HCA data infrastructure and using accessible web portals (see Dissemination).
Aim 2 : Construction of an integrated atlas of the human brain from diverse, community-generated datasets
The human brain consists of billions of cells exhibiting extraordinary heterogeneity in molecular composition, localization, electrophysiology, and function. This remarkable diversity, coupled with challenges in sample dissociation and processing, has led to a rich variety of scRNA-seq datasets of the human brain. These include human neurons from fresh tissue (Darmanis et al., 2015), laser-capture microdissection (Nichterwitz et al. 2016), Patch-seq (Lin et al. 2016), and nuclear profiling with plate-based (Lacar et al, 2016) microfluidic (Lake et al., 2016) and droplet-based (Habib et al, 2017; Lake et al 2017), yet all eight datasets measure the same underlying tissue as input. Therefore, the human brain represents an ideal tissue for development and benchmarking of methods for scRNA-seq integration.
We will therefore aim to combine all of these datasets together and construct a coherent atlas of cellular phenotypes in the human brain, based on our best-practice methods and metrics. While ambitious, our preliminary data indicate that this goal is fully achievable, and we welcome the inclusion of additional neuronal datasets that become available during the project period. We will perform a single graph-based clustering analysis on the fully integrated, and strongly expect the increased cell numbers to dramatically boost our ability to detect rare and subtle neuronal cell states.
We will aim to deliver : 1) A catalogue of cell states and genetic markers that is robust to differences in technology alongside rare cell types which are only found in a subset of datasets, 2) Systematic comparison of transcriptomic profiles for the same cell state across technologies, with a particular focus on identifying gene sets that are enriched in either scRNA-seq or scNuc-seq experiments, 3) Jupyter notebooks with reproducible workflows, laying out a clear roadmap for similar analyses in diverse tissues. We will assess the performance of our methods using the benchmarks developed above, but will leverage longstanding collaborations with neuroscientists (Tom Maniatis; Gord Fishell; Steve Mccaroll, Evan Macosko) to assist in interpretation and exploration of our findings.
Aim 3 (Rahul) : Integrate neuronal datasets across species, mapping human cell types to their mouse counterparts
HCA aims to identify hundreds to thousands of cell types in the human brain based on molecular and spatial characterization, but functional characterization or perturbation of these cells cannot be performed in humans. Understanding the human genome faces similar challenges, and comparative genomics represents an invaluable tool to identify conserved signals, highlight important differences, and map human sequences onto tractable model systems. We propose that cross-species analysis will perform a similar essential role for HCA, and play a crucial step in connecting a catalog of cell types towards deeper biological understanding.
We recently published the first integrated analysis of human and mouse pancreatic atlases, identifying ten shared cell types despite significant evolutionary divergence. We will extend this analysis here to create an integrated atlas of the mammalian nervous system, leveraging landmark neuronal datasets in the adult mouse (Zeisel, Allen, Macosko). Our successful alignment of pancreatic islet cells demonstrates that this goal is feasible, even when only a subset of transcriptomic markers are shared across species . We will also rigorously test strategies for mapping gene ontologies between species to see how different strategies affect performance. Our deliverables will be as stated in Aim 2, but here, we will also focus on reporting the best transcriptomic markers that are shared between species, potentially enabling the construction of murine Cre driver lines for functional characterization. In addition, we expect that our initial characterization of cell states that are shared across species, or unique to either, will be of significant value to the neuroscience community, and will begin to establish the power of applying lessons from comparative genomics to HCA.
Aim 3 (John):
The approaches outlined in Aim 1 focus on improving the performance of data integration methods when combining multiple datasets generated from the same underlying population of cells (e.g., a specific tissue) but using either different technologies or where cells are collected from different individuals. In the context of developmental biology and when comparing tissues across species, the assumption that we are considering the same underlying population of cells does not hold. For example, in the context of tissue differentiation, cells collected at different stages of development will consist of a mix of common cell types (e.g., precursor populations present at different time points) as well as transitional populations present at only specific time points as well as, ultimately, terminally differentiated cells.
At present, technological limitations mean that cells are sampled sequentially, meaning that to construct pseudotemporal differentiation trajectories for many biological processes it is necessary to combine information across batches. This applies both in the context of developmental biology (e.g., sampling different stages of early mouse development) and when modelling the development of human cell types in vitro using organoid based systems. To integrate data collected from sequential stages of development, we propose to jointly learn the biological manifold while correcting for batch effects.
To this end, we propose to extend the Mutual Nearest Neighbor approach by employing a more formal factor analysis framework. Specifically, we will assume that variability in the expression profiles within the combined dataset (i.e., considering cells from all timepoints) can be explained by a series of hidden factors that we want to infer. Each factor will be “active” for a given set of genes that co-vary consistently across the entire dataset.
To disentangle batch effects from biological signal we will assume that batch effects apply to large numbers of genes and will thus be captured by dense factors with large numbers of active genes (including housekeeping genes). Importantly, we will also assume that technical effects are orthogonal between pairs of batches and, crucially, that they are always orthogonal to the biological signal of interest. To identify this biologically meaningful signal we will identify informative factors, which will generally have a smaller number of active genes. Additionally, prior information, corresponding to the stages when samples are collected, can help guide the choice of factors.
Aim 3 (Oli):
The methods developed in aim 2 allow for integrating different scRNA-seq datasets by modelling shared sources of gene expression covariation. We here seek to extend these methods to additional single-cell technologies, most notably expression assays that deliver spatially resolved expression levels, a critical component of the HCA.
An important of conventional scRNA-seq of disassociated populations of cells is that the natural tissue contexts of of the cells is lost. Complementary data from spatial profiling methods provide indispensable data to fill in these gaps, allowing to place single-cell RNA-seq profiles into the context of tissue coordinates. While the generation of these dataset is already underway in different contexts and a major component of several HCA projects, there is lack of computational strategies for integrating spatial expression data and scRNA-seq datasets. To address this, we will here extend the methods derived in aim 2 to account for spatial information of the cells. These approaches will allow for obtaining new insights into spatial components of gene expression variation, including predictions of spatially expression coordination of cells from scRNA-seq. At the core of our approach will be the development of new factor models that use spatial Gaussian processes priors on the inferred factors. We have recently proposed one of the first methods for modelling spatially resolved expression datasets using this class of model (Sveensson et al., 2017). By connecting these different models it will be possible to infer factors that explain co-expression clusters with and without a spatial underpinning, which can be integrated with scRNA-seq data from disassociated cells.
-- more here --
5. Dissemination of Methods, Collaboration with CZI, Commitment to Sharing
Our laboratories have been at the forefront of methods development for single cell data analysis and integration. In 2015, the Marioni and Satija groups independently published the first analytical methods to integrate scRNA-seq datasets with in-situ hybridization databases, enabling the inference of a cell's spatial localization based on its gene expression. All groups also have created and maintained powerful, widely used, and fully open-source scRNA-seq analytical toolkits, scran (Marioni), scater (Stegle) and Seurat (Satija), demonstrating our deep commitment to fully sharing analytical methods with the community.
BRIEF PROJECT SUMMARY (250 words; currently exactly 250)