ЯтомизоnoR

R, Statistics

Draw a Double Box Plot Chart (2-Axes Box Plot; Box Plot Correlation Diagram) in R

Box plot chart

Fig. 1. Simple box plot

Fig. 1. Simple box plot

  • A visual diagram showing the distribution of numerical data.
  • Information is organized, compared to a simple scatter plot.
  • Using the quantiles than average and variance.  Unknown (not normal) distribution, too small sample size, and unremovable outliers; the box plot chart represents well the characteristics of these distributions.

# Fig. 1
> boxplot(x)
  1. Median
  2. 1st Quartile
  3. 3rd Quartile
  4. Interquartile Range: IQR
  5. Minimum and Maximum excluding outliers: Whisker
  6. Outlier
Fig. 2. Box plot with category (factor)

Fig. 2. Box plot with category (factor)

A box plot is an excellent chart that can be used to determine the significance between the categories. It works as well as the analysis of variance (ANOVA) table.

# Fig. 2
> boxplot(formula = y ~ x)

Double box plot

Fig. 3. Simple double box plot without category

Fig. 3. Simple double box plot without category

  • A double box plot chart (a 2 axes box plot; a box plot correlation diagram) is formed by superposing horizontal and vertical box plot diagrams.
  • That means the chart handles paired data (x, y) in a box plot.

However, there is no need to bother the double box plot for such as the following data.

x y
111.1 37.5
107.2 32.3
100.5 24.0
103.9 33.0
...

There are more suitable analysis methods for the data observed as a pair in this way.
In contrast, the box plot diagram comes into play for data such as the following.

x = (111.2 97.2 100.5 94 107.7 ...)
y = (37.6 32.3 24.1 33.1 35.2 ...)

You want to analyze the correlation between the two sets of data that have been observed separately. For example, you are going to analyze two sequences of heights and weights of teens, those are not pairs of height and weight, but different series of height only and weight only.

You might wonder if there is sense to analyze such a data. In fact, the double box plot diagram without category (as fig. 3) is useless.

Fig. 4. Double box plot with category

Fig. 4. Double box plot with category

A good example is such as the following;

Category: The country of the kids; Japan, the USA, Russia, and so on.
X series: Teen's height measured at each country.
Y series: Teen's weight measured at each country.

Both X and Y are expected to have a significant difference between the category, because the shape of kid’s body differs by the country. However, averages of each category may not be good when the sample sizes are small or the samples are selected with a bias. Moreover, “teens” is too large category for body size. So there should be non negligible noises of individual differences by age.

Anyway, such a case may be a failure of the experimental design, when the data was gathered at laboratory or at a planned survey. Kid’s height and weight can be re-measured as a pair sequence.

However, data from wildlife observation, meteorological and astronomical phenomena is not easy. Analysis, combining data from different research or different time of observation, is required.  Because some phenomenon is quite rare, and some data requires a huge cost to collect.

In addition, observation of uncontrolled wildlife can have extreme outliers. Because that is not a laboratory experiment, these outliers should not be excluded from the analysis. It is difficult to use the simple ANOVA to analyze such a data.

The double box plot chart opens a way to analyze the data including different origins, having too small size, and containing extreme outliers. Saying with rough words, it is a statistical tool to rediscover poor quality data.

To draw double box plot charts, you need to get a source of boxplotdou function.

How to use boxplotdou()

Download

http://code.google.com/p/cowares-excel-hello/wiki/boxplotdou_r

Sources

> source("boxplotdou.R")
> source("has.overlap.R")

Test

> source("test.boxplotdou.R")
> test1()
...

Quick Start

> str(x)
'data.frame': 100 obs of 2 variables:
$factor.x: chr jp jp us us ru ru ...
$obs.x: num 12.81 9.09 9.03 10.89 9.67 ...
> str(y)
'data.frame':. 50 obs of 2 variables:
$factor.y: chr us us us ru ru de ...
$obs.y: num 12.69 9.13 9.11 11.03 9.64 ...

Prepare data frames such as the above into variable x and y. These are pairs of category (factor) and observed value.

> boxplotdou(x, y)

This draws a double box plot chart.

  • Box-and-whiskers are categorized by color.
  • Using short abbreviation for the factor character is a good way, because they are drawn on the chart.
  • This chart is showing the relationship between obs.x and obs.y observed separately and bound by the factor.x and factor.y.
  • Calculation depends the basic boxplot() function and equivalent to that.

Drawing options

boxed.whiskers = FALSE

Set TRUE if you need a rectangular frame rather than whiskers to show the minimum and maximum range.  It will help to see the overlap of outer ranges.

outliers.has.whiskers = FALSE

Set TRUE if you need whiskers at the point of outliers.  It will help to enhance outliers visually.

name.on.axis = TRUE

Set FALSE if you do not need category names on the right and top axes.

plot = TRUE

Set FALSE if you need the result table rather than the chart.

verbose = FALSE

Set TRUE if you need a lot of output of calculation process.  Primarily for debugging purposes.

Condensing categories

condense = FALSE
condense.severity = “iqr”
condense.once = FALSE

Set condense = TRUE
if you need to unify near categories into one box-and-whisker.
Near categories are needless to distinguish, and unifying them will enlarge the sample size of the category. It also make the chart clearly seen.

The condense.severity determines the criteria of “near”.

condense.severity = “iqr”
Categories are considered “near”, when the square which is surrounded by the first and third quartile has an overlap.

condense.severity = “whisker”
Categories are considered “near”, when the square which is surrounded by the whisker has an overlap.

condense.severity = “whisker.xory” and condense.severity = “iqr.xory”
Adding “.xory” is moderating the criteria; the square is considered overlap when the either x-axis or y-axis has an overlap.

Set condense.once = TRUE
if you need to perform the condense only once.
The default value (FALSE) performs the condense repeatedly until it converges.
Primarily for debugging purposes.

Look over the double box plot using the test function

In the case there is no correlation between x and y

When there is no correlation, the box-and-whiskers overlap each other, and be arranged randomly.

Fig. 5. The results of test2

Fig. 5. The results of test2

Fig. 6. The results of test2 with condense = T

Fig. 6. The results of test2 with condense = T

In the case there is a strong correlation between x and y, and there is no difference by the category

The box-and-whiskers overlap each other. Thus the decision is difficult.
However, when the correlation is strong, there may be a tendency of their arrangement.

Fig. 7. The results of test3

Fig. 7. The results of test3

Fig. 8. The results of test3 with condense = T

Fig. 8. The results of test3 with condense = T

In the case there is a strong correlation between x and y, and there is a difference in several categories

When there is a correlation between categories, x and y, the position of the box-and-whiskers can be separated significantly in both x-axis and y-axis.

Fig. 9. The results of test4

Fig. 9. The results of test4

Fig. 10. The results of test4 with condense = T

Fig. 10. The results of test4 with condense = T

Fig. 11. The results of test4 with condense = T, condense.severity = "iqr.xory"

Fig. 11. The results of test4 with condense = T, condense.severity = “iqr.xory”

In the case there is no correlation between x and y, and there is a difference in several categories

When there is a correlation between the category and x, box-and-whiskers separate only at x-axis.

Fig. 12. The results of test5

Fig. 12. The results of test5

Fig. 13. The results of test5 with condense = T

Fig. 13. The results of test5 with condense = T

Fig. 14. The results of test5 with condense = T, condense.severity = "iqr.xory"

Fig. 14. The results of test5 with condense = T, condense.severity = “iqr.xory”

Advertisements

8 comments on “Draw a Double Box Plot Chart (2-Axes Box Plot; Box Plot Correlation Diagram) in R

  1. Pingback: Double Box Plot package boxplotdbl 1.2.0 released | ЯтомизоnoR

  2. Lorenzo Monasta
    October 23, 2013

    How can I delete the labest within the boxes?

    • tomizono
      October 24, 2013

      boxplotdou(iris[c(5,1)], iris[c(5,2)], factor.labels=F)

      This factor.labels argument is supported on 19 Oct. So download the recent one and try above, please.

    • lorenzo
      October 25, 2013

      Done, thanks!

  3. Pingback: boxplotdbl and diaplt Packages 1.0.0 Public Beta are Available | ЯтомизоnoR

  4. Pingback: Draw an Ellipse Summary Plot in R | ЯтомизоnoR

  5. Pingback: How to color box and whisker plot | ЯтомизоnoR

  6. Pingback: Draw a Double Box Plot Chart (2-Axes Box Plot; Box Plot Correlation Diagram) in R | R for Journalists | Scoop.it

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Information

This entry was posted on March 15, 2013 by and tagged , , , , , .

Navigation

The stupidest thing...

Statistics, genetics, programming, academics

ЯтомизоnoR

R, Statistics

%d bloggers like this: