Milestone 20160704

Tue July 5 10:46:37 CDT 2016 3662453b2f7e372633c5fe9ccd7e13d15611631c

Road so far and ahead: Summaries of what we have learned and what we want to do next.

Objectives

We want to discover (from existing methods) or develop (a new method) for reconstructing two-dimensional tree-like structure from data, based on observed Gaussian / counts / binary “features”. Unlike in phylogenetic problems where we only observe data at the tips, here we additionally observe internal nodes and data points in-between nodes. Although being a generic procedure, we have the following particular biological context in mind for immediate application of the methods:

Given a set of single cell samples we want to construct the cell lineage/differentiation tree
- from gene expression data, tumor mutation data, etc.
Determine the pseudotime course across all the single cell samples
Determine clusters of genes behaving differently (expression, mutation) across branches of the tree
- thus identifying genes important to given branch of development (e.g. brain vs non-brain development)
Use this tree as a structure to construct covariance prior for OmicsBMA

Competing methods

Monocle: fits a version of Minimal Spanning Tree (MST) over the single cells, finds the distance of the cells along the differentiation tree between any two cells. They primarily focus on the pseudotime and extract genes differentially expressed across the pseudotime using the GAM model for their expression data.

Slingshot: A method very much in development (Sandrine Dudoit Lab). They first do clustering and assign cluster labels to the cells. Then they do PCA and connect the PC values for the distinct clusters by MST and then tries to approximate the MST by a principal curve.

SCITE: targeted towards phylogenetic reconstruction of the mutation tree, but they have a nice MCMC approach that can be followed with some modifications to get at a model based tree structure.

Manifold learning, MST and factor analysis: see this benchmark.

Learnings from exploratory analysis

Here are some key observations from our benchmark output:

PCA: the first two PCs finds a linear pattern across the cells (a time order) that can probably be used for pseudotime construction, but branches towards leaves are less distinct. We may need other PCs to recover more local patterns. This idea has been adopted in Slingshot. Interpretability of outcome from various PCs remain an issue, in the context of pseudotime reconstruction.
Unlike PCA, t-SNE does a better job at capturing local behavior as expected but had breaks in their plot highlighting lack of power in determining global patterns.
Most of these methods seemed to be pretty robust to noisy features (including CountCluster, and possibly Flash?) except a few. Noisy samples do blur the image though not completely.
One key observation was in case of noisy features or noisy samples data, a denoising using SFA or FLASH to do a dimension reduction and then applying MST on the denoised samples gave much tighter looking and cleaner MST graph (check out layout_as_tree and layout_as_kk), compared to using MST alone. This is something, that to our knowledge, has not been applied by other methods in inferring cell differentiation patterns. Given many noisy genes in data this finding is worth pursuing.

Road ahead

We want a method:

With simple model assumptions (not specific to any particular diffusion process)
Has simple / intuitive interpretation
Scales well with both \(N\) and \(P\) of data
Robust to noisy sample / features

Additionally (ideally):

Produces a good visualization
Establishes / provides a principled procedure to identify “key” genes (e.g. differentially expressed).

Possible model

Matthew proposes outline of a possible model which we transcribe here:

notation	meaning	comments
\(D_{ij}\)	data on individual \(i\) at locus \(j\)
\(F_{kj}\)	factor \(k\) at locus \(j\)	gives expectation for data at that locus
\(Z_i = (Z_{i1},Z_{i2})\)	pair of factors giving rise to individual \(i\)
\(\pi_{k1,k2}\)	frequency with which factor pair (\(k1\),\(k2\)) occurs in data	\(\pi_{k1,k2} = Pr(Z_{i1}=k1, Z_{i2}=k2)\)
\((\lambda_i,1-\lambda_i)\)	memberships of individual \(i\) in \((Z_{i1},Z_{i2})\). \(\lambda_i \in [0, 1]\)	assume for simplicity that \(\lambda_i\) in discrete set, say {0,0.01,…,0.99,1.00}; assume uniform prior probability, so \(Pr(\lambda_i = q) = 1/101\) for this set.

Likelihood

Assume independent individuals, define likelihood for \((\pi, F)\) as

\begin{equation} L(\pi, F) = Pr(D | \pi) = \prod_i Pr(D_i | \pi,F) \label{eq:lik} \end{equation} \begin{equation} Pr(D_i | \pi,F) = \sum_{k1, k2} \pi_{k1,k2} Pr(D_i | Z_i= (k1,k2), F) \label{eq:liki} \end{equation} \begin{equation} Pr(D_i | Z_i, F) = \sum_q \Pr(\lambda_i=q) \prod_j Pr(D_{ij} | Z_i, \lambda_i=q, F) \label{eq:likzi} \end{equation} \begin{equation} E(D_{ij} | Z_i=(k1,k2), \lambda_i=q, F) = q F_{k1,j} + (1-q) F_{k2,j} \label{eq:Eij} \end{equation}

Use Poisson or multinomial or or normal likelihood on top of \ref{eq:Eij}. In normal case we can assume known (?) \(\sigma_{ij} = \sigma_j\).

Estimation

Estimate \((\pi, F)\) by maximum likelihood

Initialize \(F\).
- Maybe lots of factors - eg one factor per individual \(F_{kj} = D_{kj}\)
- or do local smoothing
  - identify neighbors of each individual \(i\), then average these to give a factor \(i\)
  - maybe t-sne or other dimension reduction useful here?
Hope is that even though lots of factors, many \(\pi\) may be 0.
Optimize over \(\pi\) given \(F\)
- Should be an EM algorithm for this
- First step is to compute \ref{eq:likzi} for each \(i, k1, k2\)
Optimize over \(F\) given \(\pi\)
- There may be EM step for \(F\) given \(\pi\)

If we are lucky \(\pi\) will be sparse; otherwise may need to force it, and re-estimate factors?

Inferring \(Z\)

\ref{eq:likzi} is computational challenging because of many potential choices of factors. Some of the ideas from discussion between Kushal and Gao were assuming mixture priors for \(Z\) over all the edges and get some proportions, and to solve for \(Z\), integrate out \(\lambda\) to get a marginal likelihood on the edges.

There are two issues:

Seems a complicated problem that does not scale with \(N\).
Unsupervise, thus factors we end up getting may not be the nodes but rather a combination of many nodes.

Some thoughts we have:

Use lessons from the exploratory benchmark: It seems from the MST on denoised data that it was indeed capturing the branches well, but was creating some mini branches sometimes. We can use the branches or the leaves we get from MST to initialize the factors.
A possible naive approach: take all the leaves and all the nodes in the tree which have MST degree greater than or equal to 3 as a potential factor (will there still be many of them?). We may want to merge the similar looking factors, or modify some of them, but it is better than starting at a completely unsupervised state. Also this may very well solve the issue of hitting a bad local minima by starting from a bad point.
Correlation structure, a property worth investigating: It seems to us that one property that we can rely on even if we do not assume Brownian motion like model is the Brownian structure of the covariance matrix. We feel even if the process is not Brownian, the correlation structure might be considered similar to the Brownian process on tree correlation structure. That is the information MST uses to build the tree for correlation distance as well and does a pretty good job at it. We may want to exploit this property more in the model based framework we propose.
Scaling with data: It will most likely scale well with \(N\) if we can start from “good points” to reduce the search space.
Feature selection: Even though “white-noise” on feature (as in the current benchmark) does not seem to matter to manifold learning and factor analysis, it is still possible that different subsets of genes abide by different tree structure. This is not simulated in the current benchmark. We should consider how to select the right set of genes to work with. But probably the difference is local.

Final comments

It seems denoising with SFA/FLASH and then subsequent application of MST is a good way to ged rid of white-noise in genes. Our method might also be a pre-processing step for MST. Currently there is not enough literature on single cell tree construction from gene expression data. Application of methods from current benchmark particularly factor analysis + MST might be interesting application by itself. Of course we expect a new model along the lines of our proposal do better job.