Monday, May 02, 2016

Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing - implementation -

Rajat just sent me the following:

  Dear Igor,

Thank you for your excellent blog Nuit Blanche!  I have learned a lot from your postings over the years.

Would you be interested in passing along to your readers my recent paper with Graham Heimberg, Hana El-Samad, and Matt Thomson?  It is on one of the few topics you have covered in your blog outside of compressive sensing, namely the growing availability and decreasing cost of RNA sequencing data.  

In our paper, we show that the high read depths conventionally used in RNA sequencing are not needed in cases where the primary results rely on clustering or classification (or other low-dimensional representations) of RNA-seq data.  This is of particular importance for single-cell RNA-seq, where read depth is inherently low due to fundamental limits in the chemistry of capturing RNA.

Our title and abstract is included below.  If this is too far outside the scope of your blog, we completely understand.

Best,
Rajat
I don't think talking about how biology is low dimensional and therefore certain bounds apply for sampling is outside the scope of Nuit Blanche :-) Thanks Rajat ! Here is how the paper starts:

The modern engineering discipline of signal processing has demonstrated that structural properties of natural signals can often be exploited to enable new classes of low cost measurements. The central insight is that many natural signals are effectively ‘‘low dimensional.’’ Geometrically, this means that these signals lie on a noisy, low-dimensional manifold embedded in the observed, high-dimensional measurement space. Equivalently, this property indicates that there is a basis representation in which these signals can be accurately captured by a small number of basis vectors relative to the original measurement dimension (Donoho, 2006; Candès et al., 2006; Hinton and Salakhutdinov, 2006). Modern algorithms exploit the fact that the number of measurements required to reconstruct a low-dimensional signal can be far fewer than the apparent number of degrees of freedom. For example, in images of natural scenes, correlations between neighboring pixels induce an effective low dimensionality that allows high-accuracy image reconstruction even in the presence of considerable measurement noise such as point defects in many camera pixels (Duarte et al., 2008). Like natural images, it has long been appreciated that biological systems contain structural features that can lead to an effective low dimensionality in data. Most notably, genes are commonly co-regulated within transcriptional modules; this produces covariation in the expression of many genes (Eisen et al., 1998; Segal et al., 2003; Bergmann et al., 2003). The widespread presence of such modules indicates that the natural dimensionality of gene expression is determined not by the number of genesin the genome but by the number of regulatory modules

Strangely enough, this figure looks like a sharp phase transition of sorts:


 
Here is the (open) paper: Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing

Summary: A tradeoff between precision and throughput constrains all biological measurements, including sequencing-based technologies. Here, we develop a mathematical framework that defines this tradeoff between mRNA-sequencing depth and error in the extraction of biological information. We find that transcriptional programs can be reproducibly identified at 1% of conventional read depths. We demonstrate that this resilience to noise of “shallow” sequencing derives from a natural property, low dimensionality, which is a fundamental feature of gene expression data. Accordingly, our conclusions hold for ∼350 single-cell and bulk gene expression datasets across yeast, mouse, and human. In total, our approach provides quantitative guidelines for the choice of sequencing depth necessary to achieve a desired level of analytical resolution. We codify these guidelines in an open-source read depth calculator. This work demonstrates that the structure inherent in biological networks can be productively exploited to increase measurement throughput, an idea that is now common in many branches of science, such as image processing.

 The read depth calculator is here: https://thomsonlab.github.io/html/formula.html


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

No comments:

Printfriendly