prise6, 02/10/2018
I found a really great package recently called compareGroups that build bivariate tables. I think it's a great work and i want to go further in the rendering.
According to my short researchs, this package have been existing since 2014, but is recently out on github and the CRAN.
V4 is currently only available on github:
library(devtools)
devtools::install_github(repo = "isubirana/compareGroups")
(this package has many dependences)
This tutorial is not a copy of the wonderfull vignette of the package because it's really well explained. Again, great job, really useful !
Let's say you have a clustering to perform, and you'd like to explain your clusters with figures and graphs. This will be a dump example with the ... iris
dataset. This post does not focus on clustering, it's only a pretext to have this final rendering:
All | Cluster 1 | Cluster 2 | Cluster 3 | p-value | |
---|---|---|---|---|---|
|
|
|
|||
N=150 | N=53 | N=47 | N=50 | ||
Sepal.Length | 5.84 (0.83) | 5.80 (0.41) | 6.78 (0.49) | 5.01 (0.35) | <0.001 |
Sepal.Width | 3.06 (0.44) | 2.67 (0.25) | 3.10 (0.26) | 3.43 (0.38) | <0.001 |
Petal.Length | 3.76 (1.77) | 4.37 (0.56) | 5.51 (0.64) | 1.46 (0.17) | <0.001 |
Petal.Width | 1.20 (0.76) | 1.41 (0.31) | 1.97 (0.33) | 0.25 (0.11) | <0.001 |
Let's see the step we're going through
We need to load and attach few packages.
compareGroups
: THE package we want to hackxml2
: DOM manipulationhtmltools
: html tags manipulationknitr
: dynamic report generationggplot2
: draw some graphslibrary(compareGroups)
library(xml2)
library(htmltools)
library(knitr)
library(kableExtra)
library(ggplot2)
library(data.table)
I set differents variables that we need along the script. Obvisously i set le NB_CLUSTERS
to 3 because it's not the point of this post to find the best number of clusters.
# global variables
VARIABLES = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
VARIABLES_AND_CLUSTER = c("cluster", VARIABLES)
NB_CLUSTERS = 3
NSTART = 100
SEED_KM = 234
We fake that iris
dataset doesn't contain any group and we want to cluster observations. We create iris.x
datatable on which we'll perform the analysis.
# load dataset
data(iris)
# make it datatable
setDT(iris)
# create iris.x
iris.x = iris[, ..VARIABLES]
str(iris.x)
## Classes 'data.table' and 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## - attr(*, ".internal.selfref")=<externalptr>
Before computing K-means i'd like to project iris.x
to one or two dimension space thanks to PCA.
# PCA scaled
PCA = prcomp(iris.x, scale. = T, center = T)
# add PCA components to iris.x datatable
iris.x = cbind(iris.x, as.data.table(PCA$x))
# see that the two first component explains majority of variability
cumsum(PCA$sdev)/sum(PCA$sdev)
## [1] 0.5352972 0.8348653 0.9549021 1.0000000
We chose two dimensions to compute K-means with 3 distinct centers and test 100 intials configurations.
# reproducibility
set.seed(SEED_KM)
# K-means
model.km = kmeans(iris.x[, list(PC1, PC2)], centers = NB_CLUSTERS, nstart = NSTART)
# add cluster columng to iris.x datatable
iris.x$cluster = model.km$cluster
# transform into factor with nice labels
iris.x[, cluster := factor(
x = cluster,
levels = seq_len(NB_CLUSTERS),
labels = paste("Cluster", seq_len(NB_CLUSTERS))
)]
Let's use compareGroups
to compute mean of the observations foreach variables and clusters.
comparegroups.main = compareGroups(
formula = cluster ~ .,
data = iris.x[, ..VARIABLES_AND_CLUSTER]
)
comparegroups.main
##
##
## -------- Summary of results by groups of 'cluster'---------
##
##
## var N p.value method selection
## 1 Sepal.Length 150 <0.001** continuous normal ALL
## 2 Sepal.Width 150 <0.001** continuous normal ALL
## 3 Petal.Length 150 <0.001** continuous normal ALL
## 4 Petal.Width 150 <0.001** continuous normal ALL
## -----
## Signif. codes: 0 '**' 0.05 '*' 0.1 ' ' 1
Above, we print the summary of the future table we're creating. Notice that p.value
are computed, if distribution is normal, anova is performed. Otherwise you'll find great explanations on this vignette.
Next, we create our table, with a ALL column, meaning all our observations.
comparegroups.main.table = createTable(
x = comparegroups.main,
show.all = T
)
comparegroups.main.table
##
## --------Summary descriptives table by 'cluster'---------
##
## ______________________________________________________________________
## [ALL] Cluster 1 Cluster 2 Cluster 3 p.overall
## N=150 N=53 N=47 N=50
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## Sepal.Length 5.84 (0.83) 5.80 (0.41) 6.78 (0.49) 5.01 (0.35) <0.001
## Sepal.Width 3.06 (0.44) 2.67 (0.25) 3.10 (0.26) 3.43 (0.38) <0.001
## Petal.Length 3.76 (1.77) 4.37 (0.56) 5.51 (0.64) 1.46 (0.17) <0.001
## Petal.Width 1.20 (0.76) 1.41 (0.31) 1.97 (0.33) 0.25 (0.11) <0.001
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Then, we can render this table into markdown format thanks to export2md()
function using knitr. We store the result in comparegroups.html
variable to modify content later.
comparegroups.html = suppressWarnings(
export2md(
x = comparegroups.main.table,
caption = "",
header.labels = c(
"all" = "All",
"p.overall" = "p-value"
)
)
)
comparegroups.html
All | Cluster 1 | Cluster 2 | Cluster 3 | p-value | |
---|---|---|---|---|---|
N=150 | N=53 | N=47 | N=50 | ||
Sepal.Length | 5.84 (0.83) | 5.80 (0.41) | 6.78 (0.49) | 5.01 (0.35) | <0.001 |
Sepal.Width | 3.06 (0.44) | 2.67 (0.25) | 3.10 (0.26) | 3.43 (0.38) | <0.001 |
Petal.Length | 3.76 (1.77) | 4.37 (0.56) | 5.51 (0.64) | 1.46 (0.17) | <0.001 |
Petal.Width | 1.20 (0.76) | 1.41 (0.31) | 1.97 (0.33) | 0.25 (0.11) | <0.001 |
First, we need to make small plots. Plots are images that we need to encode into base64 string. The idea is to store the image inside the HTML thanks to data URI scheme
For example, this HTML code is an html <img>
tag sourcing (src
attribut) a base64 image thanks to URI scheme:
<img src="data:image/png;base64,iVBORw0KGgoAAA
ANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4
//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU
5ErkJggg==" alt="Red dot" />
In your browser this HTML is a red dot:
generatePlot()
asks for the selected cluster and the whole dataset containing PC1
and PC2
. It returns a ggplot2
object with two geom_point
layer. In black, the non selcted clusters. In red, the selected cluster. PC1
as x-axis and PC2
as y-axis.
generatePlot = function(cluster_s, data) {
general_graph_data = data[!cluster %in% cluster_s, list(PC1, PC2)]
cluster_graph_data = data[cluster %in% cluster_s, list(PC1, PC2)]
g = ggplot() +
geom_point(data = general_graph_data, aes(x = PC1, y = PC2), alpha = .3) +
geom_point(data = cluster_graph_data, aes(x = PC1, y = PC2), color = "red") +
theme_void()
return(g)
}
Example:
(p = generatePlot("Cluster 1", iris.x))
We need to save our plot as an image, read it thanks to knitr::image_uri()
function that create data URI scheme.
plotToURI()
asks for the ggplot object to transform and additionnal arguments for ggsave()
function.
plotToURI = function(plot, file = NULL, ...) {
if(is.null(file)) file = paste(tempfile(), "png", sep = ".")
ggsave(filename = file, plot = plot, ...)
uri = knitr::image_uri(file)
return(uri)
}
pURI = plotToURI(p, width = 3, height = 3, units = "cm", dpi = 100)
See what is like.
str(pURI, nchar.max = 80)
## chr "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAHYAAAB2CAIAAACSx"| __truncated__
If we use img
tag with the htmltools
package around that string we observe this image:
tags$img(src = pURI)
Our table is made up of HTML tags. Rows are coded with <tr>
tag and we want to add a new row where i wrote the comment <!-- plot row -->
<table>
<caption></caption>
<thead>
<tr>
<th style="text-align:left;"> </th>
<th style="text-align:center;"> All </th>
<th style="text-align:center;"> Cluster 1 </th>
<th style="text-align:center;"> Cluster 2 </th>
<th style="text-align:center;"> Cluster 3 </th>
<th style="text-align:center;"> p-value </th>
</tr>
</thead>
<tbody>
<!-- plot row -->
<tr>
<td style="text-align:left;"> </td>
<td style="text-align:center;"> N=150 </td>
<td style="text-align:center;"> N=53 </td>
<td style="text-align:center;"> N=47 </td>
<td style="text-align:center;"> N=50 </td>
<td style="text-align:center;"> </td>
</tr>
...
</tbody>
</table>
So we need to add to the HTML table a block like :
<tr>
<td style="text-align:left;"></td>
<td style="text-align:left;"></td>
<td style="text-align:center;">
<img src="data:image/png;base64,..."/>
</td>
<td style="text-align:center;">
<img src="data:image/png;base64,..."/>
</td>
<td style="text-align:center;">
<img src="data:image/png;base64,..."/>
</td>
<td style="text-align:left;"></td>
</tr>
Let's create this html block by looping on clusters.
# create img tags and td tags around
html_block = lapply(paste("Cluster", 1:3), function(cl) {
p = generatePlot(cl, iris.x)
uri = plotToURI(p, width = 3, height = 3, units = "cm", dpi = 100)
tags$td(style = "text-align:center;", tags$img(src = uri))
})
# fill the block with td tags in empty cells
html_block = tagList(
tags$td(style = "text-align:left;"),
tags$td(style = "text-align:left;"),
html_block,
tags$td(style = "text-align:left;")
)
# we wrap the block in a tr tag
html_block = tags$tr(html_block)
Ok, we did the hardest part. Now, we want to combine the table and the html block. This is where we need xml2
package to parse XML/HTML elements.
We have to create a XML document instance thanks to read_html()
/read_xml()
function in order to manipulate HTML elements.
# xml instance of comparegroups.html
comparegroups.html.parse = read_html(as.character(comparegroups.html))
# xml instance of html_block
plot_row = read_xml(doRenderTags(html_block, indent = F))
Then, we look for the first column after <tbody>
tag because we want to insert our plot_row
before this first column.
# find reference to the first column after tbody
first_tbody_tr = xml_find_first(comparegroups.html.parse, "//tbody//tr")
# add sibling before first_tbody_tr
xml_add_sibling(first_tbody_tr, .value = plot_row, .where = "before")
We update our inital table object by replacing the old HTML table with the newest.
comparegroups.html[1] = as.character(comparegroups.html.parse)
To finish, we use kableExtra
package to generate fashion tables.
comparegroups.html %>%
kable_styling(font_size = 12, bootstrap_options = "responsive") %>%
column_spec(column = 2, bold = T)
All | Cluster 1 | Cluster 2 | Cluster 3 | p-value | |
---|---|---|---|---|---|
|
|
|
|||
N=150 | N=53 | N=47 | N=50 | ||
Sepal.Length | 5.84 (0.83) | 5.80 (0.41) | 6.78 (0.49) | 5.01 (0.35) | <0.001 |
Sepal.Width | 3.06 (0.44) | 2.67 (0.25) | 3.10 (0.26) | 3.43 (0.38) | <0.001 |
Petal.Length | 3.76 (1.77) | 4.37 (0.56) | 5.51 (0.64) | 1.46 (0.17) | <0.001 |
Petal.Width | 1.20 (0.76) | 1.41 (0.31) | 1.97 (0.33) | 0.25 (0.11) | <0.001 |