Meetings

Project meeting 20160921

Tentative schedule

The problem with not using residual standard error has been fixed. See updated results. on the PF model when there is clean signal the algorithm is robust to initiation points.
Discussions

Next steps: visualization; factor analysis DSC benchmark??
- Why visualization would not work on PC space?
Should we do some BIC on K?
Can we drop sparse features (factors becoming close to zero) as we iterate? How about also thinking of merging factors thus reducing K?

Minutes

Summary: it is re-assuring that the PFA model is computationally working. The next step is to explore more data, real and simulated, to figure out the properties of the model in practice and to inspire / formulate additional model assumptions that might be useful.

We should propose additional model assumptions and answer it empirically in data; or, looking into more data to generate these additional assumptions. I’ll open up an issue to keep track of additional model assumptions.

Simulations

Come up with simulations involving more paired factor and see how PFA behaves on larger K. We want the edges be sparse. (how should I choose what type of sparse graph?)
One possible configuration would be having factors on a circle and we try to create a linear approximation to the circle.
Brownian motion simulation: not working “perfectly” but we should not be too worried about it. The real world situation may well be different and more complicated.
Look into manifold learning literature: our problem is like a one dimensional manifold learning
e.g. how to simulate one dimensional Swiss roll and place factors on that

Real data

Look at more real data. Single cell data (Deng data) and multi-tissue data (GTEx) should be two good data types to study (because potentially they are of different nature)

Can PFA find fine / new structure that topic model does not?
Look into the data to identify changes in which genes drive the clustering of samples, and whether or not genes involved are many or little.
Write potentially new model ideas after observing behavior in data analysis

Look at manifold learning paper and see what type of data they use, and analyze some to compare. Ideally those are “clean” data sets because the fact they are published means they work well with manifold learning.

A discussion on PCA

We have observed that visualization of PFA on PC space does not look very good, yet using top PC’s in place of original features seems to be doing a decent job. One would be interested in finding out why that is the case.

In data analysis where there are very many features e.g. 20,000 genes, it is perhaps good idea to use top expressed genes rather than using top PC’s. Because interpretation should go back to the original data space not the PC space, i.e., we want to be able to comment which gene(s) contribute to the clustering by how much. Going back from PC space to original space might be hard. However if we can show the model is invariant to the choice of coordinate, we should find it easy to use PCA and project it back to the original coordinates. Currently it might be the case becase there is no assumption in the model that is relevant to the coordinate system. We can try to verify this by rotating the data and see if we get the same outcome; if true we then mathematically prove it (talk to David). Another thought would be selecting features via sparse PCA but again have to reason why that selection will retain the most informative data points for factor analysis.

Questions and answers

Q: Should we try to determine K?

A: No. Rather we give results on a number of K’s. Different K’s will give rise to different interpretations of data. Large number of K’s will lose hierarchical structure, as seen in GTEx data with topic model (lose brain vs non-brain) thus lose the big picture. However perhaps the constraint of 2 factors might help.

Q: How can we compare with other factor analysis methods?

A: We do not. We are just providing a new tool to test for a specific hypothesis. We can show other factor analysis does not give the same answer as ours, because of the additional constraints we have.

Project meeting 20160916

Tentative schedule

Status

The paired factor analysis model (PFA) has been implemented.
A version of PCA with local adjustment has been implemented. The local adjustment is done by kNN (k = 3 for now).

Problems

The implementation uses conventional EM algorithm (in C++). It does not scale well – an analysis for 15 factors and 300 features takes over 12 hours.
Performance depends on choice the right K and particularly the right initializing factors, as demonstrated in this animation. Performance is the best with the help of block models to initialize factors, but choosing the right K remains critical.
For the non-model based method, whether better local methods may be used to perform the local smoothing besides the kNN method that we are using as of now (also for pop gen data, spatial information might be good to use too)

Future plans

Improve PFA model
Look at other real data? We have downloaded a few sets but not yet looked
Do we restrict ourselves to trees? PFA creates a graph (‘web’); kNN PCA type of work balances local / global info. Where should we steer the project, particularly PFA. What applications do we want to consider, assuming we’ve fixed the model?

Meeting summary and action items

The meeting was rescheduled to Sept 20. Here is a summary: * Matthew pointed out a modeling error on gene variance – we need to use residue variance not sample variance. Kushal has updated the model after the meeting. * We need to find a better visualization of PFA results, potentially via: 1. Compute adjacency matrix of factors 2. Build planer graph of these factors with minimal overlap: Crossing Minimization within Graph Embeddings has no software but has summarized / compared other methods as of 2012. 3. Put data points onto the graph according to their loadings on factors * For the non-model based approach, we need to figure out an intellectually attractive justification of why it works: why smoothing of variance matrix provides better eigen vectors. Seemingly improved results from noisy simulated data is not enough.

Here is a todo list: * Fix pfar implementation of the residue variance problem * Re-evaluate these scenarios particularly under the PF model; think about expected behavior of BM model * Work on visualization * Make a factor analysis benchmark out of the tree benchmark to compare PFA with SFA, CountCluster and Flash * Look at various data-sets; build them into benchmarks. We should come up with a factor analysis benchmark * Think about justifications of the non model based tree reconstruction. In the mean time tidy up the tree benchmark with other established tree methods * Optimize: the PF algorithm can be optimized by dropping at each iteration the sparse features; also the factors close enough to each other can be merged. The question is to define how to call sparse features and how to decide the threshold of factors to be merged * BIC for choice of K: or some other criteria. Need to talk to Matthew.

Project meeting 20160726

A status review with [at]kkdey on the project:

Recap

So far we have completed the factor model (calling it PFA for now, see the pfar repo). Next steps:

Currently we have SFA, FLASH, CountCluster and PFA as model based approach to the tree problem. We need to evaluate them more formally on simulation / real data with different K settings; specifically we are interested in how PFA + MST compares with other combo methods
We want to develop the use of PFA in two context for now
To figure out from the factor analysis the pseudo time course for single cell data – projection / visualization
To use this factor analysis in OmicsBMA model
After playing with it on the benchmark we need to get more data to run it
We are still interested in pursuing the G + MST + L procedure where we first do “global pruning” followed by MST to get a tree, then do “local pruning” to merge branches. A related, generic question is how to leverage global dimension reduction tech e.g. PCA and local e.g. t-sne. Possibly relevant work in the field for ancestry analysis include:
Anad’s method that relates genotype distance to spacial distance
trace / laser from UMich (http://csg.sph.umich.edu/chaolong/LASER)

Plan for the next week:

Improve DSC2 – resolve it scale issues and have a first DAG free implementation that works on cluster, then build PFA into the existing benchmark and expand it for more settings and data-sets.

Project meeting 20160630

An update on DSC2 and tree data (group meeting, 12pm room 427; followed by a private meeting at 2pm)

Tree data exploration

Motivation (and biological motivation??)
Exploratory analysis setup, methods evaluated and DSC implementation if time permits
Show the benchmark and make highlights
Next steps

Project meeting 20160620

A brief meeting with Matthew.

Project status

Updates on MST, SFA and CountCluster with Kushal Dey

Criteria for a good model

Think, in terms of factor analysis, what criteria are we optimizing?

For a factor analysis, Y = LF + E, we can evaluate ||Y - LF||2 and compute what it is for the true and fitted LF. The true LF can be derived with the tree structure in mind: every sample has two factors with loadings reflecting the mixture proportions – this is the best we can possibly do, if we add the “at most 2 factors” and “loadings (0,1)” constraints. We can use this criteria on factor analysis based methods including PCA. We can also initialize SFA algorithm with true L and F. In that way, we can decide if the whether or not we should blame the search strategy or the (lack of) constraints.

Next steps

Use FLASH (Do not chat with Wei when implementing it!).

Project meeting 20160608

Meeting with Matthew.

Tentative schedule

Results of manifold learning methods from Brownian motion simulation involving intermediate states (nodes). PCA looks promising. t-sne is interesting, too.
A reflection on the possible biological problem (thus if the project is worth digging)

RNA / DNA sequencing in single cell, at different snapshots during differentiation / cancer progression, involving multiple cell stages. (Joint modeling of RNA / DNA is suggested by Joyce in light of recent lab technology development)
Would it become an admixture problem if bulk sequence data is used instead?

Next steps

Add noise to simulation – but I no longer want to work with Brownian model
Use more principled method (for example Bravo 2009) for tree re-construction
Create more realistic simulation assuming availability of both DNA and RNA data

Comments on current results

PCA captures the structure, but not the distances.
For example for admixture samples, the distance on PCA plot is not linearly related to how much admixture is going on
PCA may be distorted if unequal sampling on different branches is applied.
For example, 2 groups of unequal sample sizes. PCA may focus on maximizing the variance on the first group instead of trying to distinguish the two groups, because it is much larger compare to the 2nd
t-sne keeps the distances well, and is promising apart from some points where the connections break. This is likely due to the choice of some parameters that defines how “local” it works. Some tuning may help.

(PCA is not a model although lpatcher says otherwise)

About simulation settings

Want to play more with more established generic methods (PCA) as opposed to some more obscure methods (that not many people have heard of). The PCA result is satisfactory (to Matthew) but this is signal rich data. Need to add some noise to it.

Add noise by non-uniform sampling

Having different number of points on different branches. This is designed to distort PCA

Add noise to sample data

Add a random N(0, sigma^2) noice to all samples. e.g. noisy gene expression measures
Add many other iid sites. e.g. only some genes are informative to reconstructing tree structure

About other methods

Factor analysis

For example, sfa, flash. Factors correspond to nodes on the tree. We can fit these models. To assess the performance, we can run sfa or flash then examine if it finds the 15 factors by comparing how similar the factors are with the 15 nodes. Or evaluate for each sample how many factors it is loaded on. Ideally an individual only has two factors at most (in between a pair of nodes) so they should be either loaded on one or two factors.

Concerns (Matthew): factor analysis may not work well because factor space is big. For the current example where there are 15 nodes each has 100 genes there are 15 X 100 parameters to estimate, which is big.

Also should fit it to CountCluster package, which has a constraint on loadings being positive (count data are positive). Ask Kushal if there is a way to restrict K to only 2? Need to transform BM data to counts though, via exp (if we think BM data as log count).

Phylogenetics methods

We do have more data (internal nodes and in-between nodes) but is less informative – too much data mixed up, compared to in phylogenetics only tip is observed and try to reconstruct from there.

What can we learn from phylogenetics methods? * Parsimony: will it help if we incorporate some parsimony criteria? Matthew thinks there is no simple parsimony algorithm to our problem. * Maximum likelihood criteria. * Algorithms (search algorithms) to fit the models Matthew thinks perhaps parsimony criteria will be equivalent to maximum spanning tree.

Other methods

Also should try other published tree construction methods in single cell papers. Do probit transformation to convert BM data to binary if necessary. Or logistic link. Only work with method that has implementation and are readily usable.

I’ll make a separate issue to list methods to benchmark.

Add constraints to factor analysis

Adding some constraints will help for noisy data. Perhaps we can add first the 2 factor constraints and positive factor constraints (meaning individuals should be in between factors). Like STRUCTURE with 2 factor constraint. Maybe this is good enough to recover a tree structure even though we do not add any tree constraints yet. We can see if it works and decide whether to design a tree constraint. Hopefully not, because the methods thus developed will be less generic.

Maybe we need positivity constraint on both factor and loadings (expression always positive).

Possible new development: X = LF; iteratively estimate L and F. Here we mess with L updates. If we add a 2 factor constraint we perform choose(k, 2) search. But the space is discrete and optimal is hard to evaluate.

Another idea is to use some “knob” analogous to the regularization parameter of LASSO. As iterations goes on we graduate turn the knob i.e. requiring number of factors to be 2.

Summary

At this point we should be nosy and curious and ask questions. We are still exploring and learning about the problem.

Project meeting 20160409

About Data

Types of trees

Temporal: evolution, cell development
- Bifurcating trees
Spatial: population admixture
- Minimum spanning tree

Clearly there is an interest in low-rank approximation for both types of trees, but can we use a unified method?

About Methods

Off-the-shelf manifold learning methods

Methods use distances to learn about “shapes”
Geodestic distances: global & local
Covariances

Tree construction methods

Celltree: uses LDA distance + ad hoc tree construction
Bravo (2009): covariances + mixed integer programming

Previous applications of spectral embedding in genetics data

Marina Meila (2000)

Can we use LDA distance + MIP?