interactive scatterplot for visually exploring pairwise correlation
An implementation of the s-CorrPlot in R using C++ and OpenGL.
The s-CorrPlot is an interactive scatterplot for visually exploring pairwise correlation coefficients between variables in large datasets. Variables are projected as points on a scatterplot with respect to some user-selected variables of interest, driven by a geometric interpretation of correlation. The correlation of all other variables to the selected one is indicated by vertical gridlines in the plot. By selecting new variables of interest, a user can create simple tours of the correlation space through animations between different projections of the data.
For further details about the s-CorrPlot, please refer to our companion paper.
The call scorr(data) will initialize the s-CorrPlot tool with your input dataset. In this view, the default projection is based off of the first two principal component vectors of the dataset, but you can select any other variables in the plot to animate to a new projection of the data. These principal components are illustrated as interactive bar-charts on the left of the screen. A dataset must provide factor labels for the entire dataset, and each unique label is given its own color as shown in the boxes along the bottom. Lastly, on the right is both a parallel coordinate plot showing a profile of the observations for each variable as well as a text display for all variables near the cursor in the display.
The s-CorrPlot expects a data frame where each row is your variable and its columns correspond to different observations, where rownames and colnames correspond to these respectively. Additionally, a final column of nominal labels is expected (the factor data type), so that each variable gets an associated color in the plot. Each row variable will become a point in the final s-CorrPlot display, and the observations will be displayed for these points using the parallel coordinates view. For further details, please see the package's provided data data(NAME) as well as the data frame constructed in the example.
Hovering the mouse over the data points will update the parallel coordinate plot and text display views by the variables near the cursor. Additionally, you can left-click variables or the bar-charts to select them as the new primary variable of interest, thus animating the plot accordingly and this point moves to the right-most part of the circular plot. All correlation coefficients to this primary variable can be read directly by vertical gridlines, where anti-correlated variables are furthest away and zero correlation is near the middle.
Additionally, you can right-click the bars or points in order to select them as the new secondary variable of interest, and then the s-CorrPlot will animate the projection by moving all variables vertically to their new locations. It is also possible to shift-click on variables in order to affix them in the parallel coordinates plot for quick comparison against other variables.
To analyze correlation with Spearman’s rank correlation, press ’s’. You can return to the default of Pearson’s correlation by pressing ’p’.
To remove all highlighting and selected variables in the display, press ’c’.
To adjust transparency of data points, press the left and right arrow keys.
To adjust the size of data points, press the up and down arrow keys.
To adjust the kernel bandwidth for density estimation, press ’<’ or ’>’.
To change the background color of the plot, press ’b’.
Once you have initialized the s-CorrPlot tool, there are several commands to interact with the display from the R console.
The primary or secondary projection can be set directly by passing in a vector with the observations stored in the columns (or length equal to the number of observations in the input data). These can be input using scorr.set.primary() and scorr.set.secondary(). Additionally, these same vectors can be output to R using the associated get functions.
The values used for density estimation in the current view of the tool can be output back into R using the call scorr.get.density().
To retrieve the currently n top correlation or anti-correlated points (i.e. left-most and right-most points, respectively), call scorr.get.cor(n) or scorr.get.acor(n), which returns a data frame with n rows with columns corresponding to each variable's index in the input data, its correlation to the primary variable, and its rowname. Similarly, you can call either scorr.get.corr(r) or scorr.get.acorr(r) in order to output all variables with above or below the given correlation coefficient, respectively.
To highlight a set of variables, pass a list of indices to highlight into the command scorr.highlight.index() or a list of names to scorr.highlight.name().
The call scorr.close() will close the interactive tool safely.
The function scorr.plot() takes data and two variables as input in order to produce a static version of the s-CorrPlot, in case the interactive tool is unable to build or run on a machine. This is useful for Windows computers, since they are unable to compile the tool.
scorr(data, alpha = 0.1, useDensity = T, showProfile = T, showPatch = F, threshold = 0, coloring = 1 : nrows(data), perms = 0)
data: A data frame with d columns and n rows. Each row is a variable and the first 1:(d-1) columns are observations and the nth column is a label assignment in the form of a factor. Names for the variables should be supplied in rownames(data), and names for each column can be supplied in colnames(data)
alpha: Set the initial transparency of points.
useDensity: Set transparency of the plotted variables based on a density estimate of the distribution along the correlation sphere.
showProfile: Show the parallel coordinates plot.
showPatch: Show each variable as a greysacle patch image.
threshold: Variables with variance below the threshold are excluded.
coloring: Alternate continuous values for each variable for coloring the data instead of using labels (switch by pressing c).
perms: Number of permutations for permutation test of projection. The permutation test computes the probability that there is a projection with more points with correlation r >a for a >=0 and r <a for a >=0 than the current shown projection.
scorr.set.primary(v)
v: Vector with d-1 columns in order to set a new projection.
scorr.set.secondary(v)
v: Vector with d-1 columns in order to set a new projection.
scorr.get.primary()
output the primary variable vector of interest for the current projection
scorr.get.secondary()
output the secondary variable vector of interest for the current projection
scorr.get.size()
get the number of points in the plot
scorr.get.selected()
output the list of selected points
scorr.get.density()
get current density values of all points in the scatterplot
scorr.get.corr(r)
r: Correlation coefficient to threshold which variables to output.
scorr.get.acorr(r)
r: Correlation coefficient to threshold which variables to output.
scorr.get.cor(n)
n: Number of top anti-/correlations to retrieve.
scorr.get.acor(n)
n: Number of top anti-/correlations to retrieve.
scorr.get.name(indices)
indices: Variable indices to retrieve the names of, likely in order to highlight them in the s-CorrPlot display.
scorr.highlight.index(indices)
indices: Variable indices to retrieve the names of, likely in order to highlight them in the s-CorrPlot display.
scorr.highlight.name(names)
names: Data point names to highlight in the s-CorrPlot display
scorr.close()
close scorr
scorr.plot(data2, v2, v2, alpha = 0.1)
data2: A data frame where rows correspond to variables. May include the final column as factor labels for automatic coloring but not required.
v2: Both a primary and secondary vector for defining a static s-CorrPlot view. Can include labels from the rows or pipe in from the interactive display.
alpha: Set the initial transparency of points.
This is a code example to create a random dataset and view it in scorr.
# load library
library(scorr)
# create a uniform random dataset
# 10,000 variables with 10 observations each
m = matrix(runif(10 * 10000), ncol = 10, nrow = 10000);
data = data.frame(m)
# add random labels (as factors!)
data$labels = as.factor(ceiling((0.1 + runif(10000)) * 4))
levels(data$labels) = c("A", "B", "C", "D", "E")
names = c();
for(i in 1 : 10000)
names[i] = paste("name", i, sep = "-")
rownames(data) = names
# start up the s-CorrPlot interactive display
# disable density estimation since random data is spread out
scorr(data)
# get the number of variables or points on the screen
scorr.get.size()
# select a new variable of interest
scorr.set.primary(data[3, 1 : ncol(data) - 1])
# query the currently selected primary variable of interest (as index)
scorr.get.selected()
# query all points for their density estimation values
# MUST have enabled density estimation earlier to do so!
density = scorr.get.density()
# query the positive correlations between 0.95 and 1 in our current projection
corr = scorr.get.corr(0.95)
# query the negative correlations between -0.95 and -1 in our current projection
acorr = scorr.get.acorr(0.95)
# query the top 100 correlations in our current projection
cor = scorr.get.cor(100)
# query the top 100 anti-correlations in our current projection
acor = scorr.get.acor(100)
# highlight our first ten variables in the plot
scorr.highlight.index(1 : 10);
# highlight some other variables by their names
scorr.highlight.name(c("name-45", "name-101"))
# create a static plot of our current view
# must have called scorr() first in order to use the get functions!
scorr.plot(data, scorr.get.primary(), scorr.get.secondary(), 0.2)
# before closing R, be sure to close the s-CorrPlot!
# you can call the function below or just press 'q' in the tool
scorr.close()
If you have any difficulties or questions, please contact sean@cs.utah.edu.