Great Lakes Bioinformatics
Conference (GLBIO)

May 20, 2019

Tutorial on Dimensionality Reduction Methods for Biomedical Data

Many real-world datasets are high dimensional in their raw form but have low-dimensional structure, groupings, or representations. Dimensionality reduction methods have been applied to various biomedical datasets with the aim of cancer subtype extraction from mutational signatures, genotype-to-phenotype mapping, gene regulatory program identification, unsupervised multi-omics data integration, and cell differentiation trajectory visualization. This workshop will provide an opportunity to explore a handful of powerful dimensionality reduction methods: matrix factorization, PCA/LDA/GDA, t-SNE and UMAP, diffusion map, and autoencoders. All demos and exercises will use real biomedical datasets from single cell RNA-seq, Hi-C, and de-identified medical records.

Topics

Non-negative Matrix Factorization (NMF)
Principal Component Analysis (PCA) & family
t-SNE and UMAP
Diffusion map
Autoencoders

Datasets

Single cell RNA-seq datasets
Hi-C datasets
Deidentified medical records

Presenters

Brittany Baur

Postdoctoral fellow in Genomic Sciences Training Program (GSTP) |
Wisconsin Institute for Discovery |
University of Wisconsin, Madison

Da-Inn Erika Lee

PhD Student in Biomedical Data Science | Department of Biostatistics & Medical Informatics |
University of Wisconsin, Madison

Xiaotong Liu

PhD student in Bioinformatics and
Computational Biology |
University of Minnesota, Twin Cities

Henry Ward

PhD student in Bioinformatics and
Computational Biology |
University of Minnesota, Twin Cities

Agenda

10:00

Introduction

Everyone

We’ll introduce ourselves, briefly talk about the motivation behind the tutorial, and go over the plan for the worshop.

15:00

Non-negative Matrix Factorization (NMF)

Da-Inn Erika Lee

Datasets can often be represented as a matrix - think Netflix dataset where the rows represent the users, the columns the movies, each element in the matrix the star ratings. Matrix factorization yields lower-dimensional factors that allow co-clustering, e.g. groups of users with similar taste and groups of movies they tend to like. We’ll go over the intuition behind matrix factorization, its popular variants, and a demo with a single-cell RNA-seq dataset.

15:00

Principal Component Analysis (PCA) and its variants

Xiaotong Liu

Principal component analysis (PCA) has been widely used as a computational technique for dimensionality reduction and data visualization. It can simplify the complexity in high-dimensional biological data while capturing the major variance trend within the data. In this tutorial, we will present the mathematical basis of PCA and its application in biological study. Variants of PCA including linear discriminative analysis (LDA) and generalized discriminative analysis (GDA) will also be discussed regarding their usage in dimensionality reduction.

15:00

t-SNE and UMAP

Da-Inn Erika Lee and Henry Ward

t-SNE and UMAP are approaches whose main aim is visualization: projecting high-dimensional datasets in 2- or 3-dimensional space to explore easily discernable patterns or clusters. Intuitively, these methods preserve the similarity or distance between data points in the original high-dimensional space when the data points are projected to 2D or 3D space. We will cover their differences and and their application in single-cell RNA-seq data.

15:00

Short break

15:00

Diffusion map and spectral clustering

Brittany Baur

Diffusion maps are a non-linear dimensionality reduction technique used in many areas such as defining differentiation trajectories in single cell analysis. Diffusion maps aim to identify the underlying the lower dimensional structure (manifold) that the data has been sampled from. Unlike PCA, diffusion maps create a lower dimensional representation even when the underlying manifold is non-linear. Diffusion maps provide a global description of the dataset by characterizing the relationship between the samples using heat diffusion and random walk Markov chain followed by creating a low-dimensional embedding. Spectral clustering is an application of diffusion maps that is used extensively to identify communities in graphs.

15:00

Autoencoders and dimensionality reduction with neural nets

Henry Ward

05:00

Wrap-up

Everyone

We’ll point you to the tutorial website, code repository, and recommended reading list.

Materials

Slidedecks and demo codes can be found here:
https://github.com/dimension-reduction/slides-and-code

Recommended Readings

Gene prioritization using Bayesian matrix factorization with genomic and phenotypic side information

Zakeri et al. Bioinformatics. 2018

Genes mirror geography within Europe

Novembre et al. Nature. 2008

How to Use t-SNE Effectively

Wattenberg et al. Distill. 2016

Metagenes and molecular pattern discovery using matrix factorization

Brunet et al. P Natl Acad Sci Usa. 2004

Reducing the dimensionality of data with neural networks

Hinton et al. Science. 2006

Spectral clustering using Nyström approximation for the accurate identification of cancer molecular subtypes

Shi & Xu. Sci Rep-UK. 2017

Diffusion maps for high-dimensional single-cell analysis of differentiation data

Haghverdi et al. Bioinformatics. 2015

Dimensionality reduction for visualizing single-cell data using UMAP

Becht et al. Nat Biotechnol. 2018

Enter the Matrix: Factorization Uncovers Knowledge from Omics

Stien-O’Brien et al. Trends Genet. 2018

Multi-Omics Factor Analysis - a framework for unsupervised integration of multi-omics data sets

Argelaguet et al. Mol Syst Biol. 2018