Compare Groups + Vizzz

prise6, 02/10/2018

Motivations

I found a really great package recently called compareGroups that build bivariate tables. I think it's a great work and i want to go further in the rendering.

According to my short researchs, this package have been existing since 2014, but is recently out on github and the CRAN.

V4 is currently only available on github:

library(devtools)
devtools::install_github(repo = "isubirana/compareGroups")

(this package has many dependences)

This tutorial is not a copy of the wonderfull vignette of the package because it's really well explained. Again, great job, really useful !

Goal

Let's say you have a clustering to perform, and you'd like to explain your clusters with figures and graphs. This will be a dump example with the ... iris dataset. This post does not focus on clustering, it's only a pretext to have this final rendering:

All Cluster 1 Cluster 2 Cluster 3 p-value
N=150 N=53 N=47 N=50
Sepal.Length 5.84 (0.83) 5.80 (0.41) 6.78 (0.49) 5.01 (0.35) <0.001
Sepal.Width 3.06 (0.44) 2.67 (0.25) 3.10 (0.26) 3.43 (0.38) <0.001
Petal.Length 3.76 (1.77) 4.37 (0.56) 5.51 (0.64) 1.46 (0.17) <0.001
Petal.Width 1.20 (0.76) 1.41 (0.31) 1.97 (0.33) 0.25 (0.11) <0.001

Let's see the step we're going through

  1. Packages
  2. Data and global variables
  3. Perform quick clustering with PCA and K-means
  4. Compare clusters
  5. Compare clusters with graphs and figures

1. Packages

We need to load and attach few packages.

library(compareGroups)
library(xml2)
library(htmltools)
library(knitr)
library(kableExtra)
library(ggplot2)
library(data.table)

2. Data and global variables

I set differents variables that we need along the script. Obvisously i set le NB_CLUSTERS to 3 because it's not the point of this post to find the best number of clusters.

# global variables
VARIABLES             = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
VARIABLES_AND_CLUSTER = c("cluster", VARIABLES)
NB_CLUSTERS           = 3
NSTART                = 100
SEED_KM               = 234

We fake that iris dataset doesn't contain any group and we want to cluster observations. We create iris.x datatable on which we'll perform the analysis.

# load dataset
data(iris)
# make it datatable
setDT(iris)
# create iris.x
iris.x = iris[, ..VARIABLES]
str(iris.x)
## Classes 'data.table' and 'data.frame':   150 obs. of  4 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

3. Perform quick clustering with PCA and K-means

Before computing K-means i'd like to project iris.x to one or two dimension space thanks to PCA.

# PCA scaled
PCA = prcomp(iris.x, scale. = T, center = T)
# add PCA components to iris.x datatable
iris.x = cbind(iris.x, as.data.table(PCA$x))
# see that the two first component explains majority of variability
cumsum(PCA$sdev)/sum(PCA$sdev)
## [1] 0.5352972 0.8348653 0.9549021 1.0000000

We chose two dimensions to compute K-means with 3 distinct centers and test 100 intials configurations.

# reproducibility
set.seed(SEED_KM)
# K-means
model.km = kmeans(iris.x[, list(PC1, PC2)], centers = NB_CLUSTERS, nstart = NSTART)
# add cluster columng to iris.x datatable
iris.x$cluster = model.km$cluster
# transform into factor with nice labels
iris.x[, cluster := factor(
  x      = cluster,
  levels = seq_len(NB_CLUSTERS),
  labels = paste("Cluster", seq_len(NB_CLUSTERS))
)]

4. Compare clusters

Let's use compareGroups to compute mean of the observations foreach variables and clusters.

comparegroups.main = compareGroups(
  formula          = cluster ~ .,
  data             = iris.x[, ..VARIABLES_AND_CLUSTER]
)
comparegroups.main
## 
## 
## -------- Summary of results by groups of 'cluster'---------
## 
## 
##   var          N   p.value  method            selection
## 1 Sepal.Length 150 <0.001** continuous normal ALL      
## 2 Sepal.Width  150 <0.001** continuous normal ALL      
## 3 Petal.Length 150 <0.001** continuous normal ALL      
## 4 Petal.Width  150 <0.001** continuous normal ALL      
## -----
## Signif. codes:  0 '**' 0.05 '*' 0.1 ' ' 1

Above, we print the summary of the future table we're creating. Notice that p.value are computed, if distribution is normal, anova is performed. Otherwise you'll find great explanations on this vignette.

Next, we create our table, with a ALL column, meaning all our observations.

comparegroups.main.table = createTable(
  x        = comparegroups.main,
  show.all = T
)
comparegroups.main.table
## 
## --------Summary descriptives table by 'cluster'---------
## 
## ______________________________________________________________________ 
##                 [ALL]     Cluster 1   Cluster 2   Cluster 3  p.overall 
##                 N=150       N=53        N=47        N=50               
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Sepal.Length 5.84 (0.83) 5.80 (0.41) 6.78 (0.49) 5.01 (0.35)  <0.001   
## Sepal.Width  3.06 (0.44) 2.67 (0.25) 3.10 (0.26) 3.43 (0.38)  <0.001   
## Petal.Length 3.76 (1.77) 4.37 (0.56) 5.51 (0.64) 1.46 (0.17)  <0.001   
## Petal.Width  1.20 (0.76) 1.41 (0.31) 1.97 (0.33) 0.25 (0.11)  <0.001   
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Then, we can render this table into markdown format thanks to export2md() function using knitr. We store the result in comparegroups.html variable to modify content later.

comparegroups.html = suppressWarnings(
  export2md(
    x             = comparegroups.main.table,
    caption       = "",
    header.labels = c(
      "all"       = "All",
      "p.overall" = "p-value"
    )
  )
)
comparegroups.html
All Cluster 1 Cluster 2 Cluster 3 p-value
N=150 N=53 N=47 N=50
Sepal.Length 5.84 (0.83) 5.80 (0.41) 6.78 (0.49) 5.01 (0.35) <0.001
Sepal.Width 3.06 (0.44) 2.67 (0.25) 3.10 (0.26) 3.43 (0.38) <0.001
Petal.Length 3.76 (1.77) 4.37 (0.56) 5.51 (0.64) 1.46 (0.17) <0.001
Petal.Width 1.20 (0.76) 1.41 (0.31) 1.97 (0.33) 0.25 (0.11) <0.001

5. Compare clusters with graphs and figures

First, we need to make small plots. Plots are images that we need to encode into base64 string. The idea is to store the image inside the HTML thanks to data URI scheme

For example, this HTML code is an html <img> tag sourcing (src attribut) a base64 image thanks to URI scheme:

<img src="
ANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4
//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU
5ErkJggg==" alt="Red dot" />

In your browser this HTML is a red dot:

Red dot

Function to generate plot

generatePlot() asks for the selected cluster and the whole dataset containing PC1 and PC2. It returns a ggplot2 object with two geom_point layer. In black, the non selcted clusters. In red, the selected cluster. PC1 as x-axis and PC2 as y-axis.

generatePlot = function(cluster_s, data) {
  general_graph_data = data[!cluster %in% cluster_s, list(PC1, PC2)]
  cluster_graph_data = data[cluster %in% cluster_s, list(PC1, PC2)]
  
  g = ggplot() +
    geom_point(data = general_graph_data, aes(x = PC1, y = PC2), alpha = .3) +
    geom_point(data = cluster_graph_data, aes(x = PC1, y = PC2), color = "red") +
    theme_void()
  
  return(g)
}

Example:

(p = generatePlot("Cluster 1", iris.x))

Function to generate data URI scheme

We need to save our plot as an image, read it thanks to knitr::image_uri() function that create data URI scheme.

plotToURI() asks for the ggplot object to transform and additionnal arguments for ggsave() function.

plotToURI = function(plot, file = NULL, ...) {
  
  if(is.null(file)) file = paste(tempfile(), "png", sep = ".")
  
  ggsave(filename = file, plot = plot, ...)
  uri = knitr::image_uri(file)
  
  return(uri)
}
pURI = plotToURI(p, width = 3, height = 3, units = "cm", dpi = 100)

See what is like.

str(pURI, nchar.max = 80)
##  chr ""| __truncated__

If we use img tag with the htmltools package around that string we observe this image:

tags$img(src = pURI)

HTML Tables

Our table is made up of HTML tags. Rows are coded with <tr> tag and we want to add a new row where i wrote the comment <!-- plot row -->

<table>
<caption></caption>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> All </th>
   <th style="text-align:center;"> Cluster 1 </th>
   <th style="text-align:center;"> Cluster 2 </th>
   <th style="text-align:center;"> Cluster 3 </th>
   <th style="text-align:center;"> p-value </th>
  </tr>
 </thead>
<tbody>
  <!-- plot row -->
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> N=150 </td>
   <td style="text-align:center;"> N=53 </td>
   <td style="text-align:center;"> N=47 </td>
   <td style="text-align:center;"> N=50 </td>
   <td style="text-align:center;">  </td>
  </tr>
  ...
</tbody>
</table>

So we need to add to the HTML table a block like :

<tr>
  <td style="text-align:left;"></td>
  <td style="text-align:left;"></td>
  <td style="text-align:center;">
    <img src="data:image/png;base64,..."/>
  </td>
  <td style="text-align:center;">
    <img src="data:image/png;base64,..."/>
  </td>
  <td style="text-align:center;">
    <img src="data:image/png;base64,..."/>
  </td>
  <td style="text-align:left;"></td>
</tr>

Let's create this html block by looping on clusters.

# create img tags and td tags around
html_block = lapply(paste("Cluster", 1:3), function(cl) {
  p = generatePlot(cl, iris.x)
  uri = plotToURI(p, width = 3, height = 3, units = "cm", dpi = 100)
  
  tags$td(style = "text-align:center;", tags$img(src = uri))
})

# fill the block with td tags in empty cells
html_block = tagList(
  tags$td(style = "text-align:left;"),
  tags$td(style = "text-align:left;"),
  html_block,
  tags$td(style = "text-align:left;")
)

# we wrap the block in a tr tag
html_block = tags$tr(html_block)

Parse HTML

Ok, we did the hardest part. Now, we want to combine the table and the html block. This is where we need xml2 package to parse XML/HTML elements.

We have to create a XML document instance thanks to read_html()/read_xml() function in order to manipulate HTML elements.

# xml instance of comparegroups.html
comparegroups.html.parse = read_html(as.character(comparegroups.html))
# xml instance of html_block
plot_row = read_xml(doRenderTags(html_block, indent = F))

Then, we look for the first column after <tbody> tag because we want to insert our plot_row before this first column.

# find reference to the first column after tbody
first_tbody_tr = xml_find_first(comparegroups.html.parse, "//tbody//tr")
# add sibling before first_tbody_tr
xml_add_sibling(first_tbody_tr, .value = plot_row, .where = "before")

We update our inital table object by replacing the old HTML table with the newest.

comparegroups.html[1] = as.character(comparegroups.html.parse)

Render HTML

To finish, we use kableExtra package to generate fashion tables.

comparegroups.html %>%
  kable_styling(font_size = 12, bootstrap_options = "responsive") %>%
  column_spec(column = 2, bold = T)
All Cluster 1 Cluster 2 Cluster 3 p-value
N=150 N=53 N=47 N=50
Sepal.Length 5.84 (0.83) 5.80 (0.41) 6.78 (0.49) 5.01 (0.35) <0.001
Sepal.Width 3.06 (0.44) 2.67 (0.25) 3.10 (0.26) 3.43 (0.38) <0.001
Petal.Length 3.76 (1.77) 4.37 (0.56) 5.51 (0.64) 1.46 (0.17) <0.001
Petal.Width 1.20 (0.76) 1.41 (0.31) 1.97 (0.33) 0.25 (0.11) <0.001

Go further