- A visual diagram showing the distribution of numerical data.
- Information is organized, compared to a simple scatter plot.
- Using the quantiles than average and variance. Unknown (not normal) distribution, too small sample size, and unremovable outliers; the box plot chart represents well the characteristics of these distributions.

# Fig. 1 > boxplot(x)

- Median
- 1st Quartile
- 3rd Quartile
- Interquartile Range: IQR
- Minimum and Maximum excluding outliers: Whisker
- Outlier

A box plot is an excellent chart that can be used to determine the significance between the categories. It works as well as the analysis of variance (ANOVA) table.

# Fig. 2 > boxplot(formula = y ~ x)

- A double box plot chart (a 2 axes box plot; a box plot correlation diagram) is formed by superposing horizontal and vertical box plot diagrams.
- That means the chart handles paired data (x, y) in a box plot.

However, there is no need to bother the double box plot for such as the following data.

x y 111.1 37.5 107.2 32.3 100.5 24.0 103.9 33.0 ...

There are more suitable analysis methods for the data observed as a pair in this way.

In contrast, the box plot diagram comes into play for data such as the following.

x = (111.2 97.2 100.5 94 107.7 ...) y = (37.6 32.3 24.1 33.1 35.2 ...)

You want to analyze the correlation between the two sets of data that have been observed separately. For example, you are going to analyze two sequences of heights and weights of teens, those are not pairs of height and weight, but different series of height only and weight only.

You might wonder if there is sense to analyze such a data. In fact, the double box plot diagram without category (as fig. 3) is useless.

A good example is such as the following;

Category: The country of the kids; Japan, the USA, Russia, and so on. X series: Teen's height measured at each country. Y series: Teen's weight measured at each country.

Both X and Y are expected to have a significant difference between the category, because the shape of kid’s body differs by the country. However, averages of each category may not be good when the sample sizes are small or the samples are selected with a bias. Moreover, “teens” is too large category for body size. So there should be non negligible noises of individual differences by age.

Anyway, such a case may be a failure of the experimental design, when the data was gathered at laboratory or at a planned survey. Kid’s height and weight can be re-measured as a pair sequence.

However, data from wildlife observation, meteorological and astronomical phenomena is not easy. Analysis, combining data from different research or different time of observation, is required. Because some phenomenon is quite rare, and some data requires a huge cost to collect.

In addition, observation of uncontrolled wildlife can have extreme outliers. Because that is not a laboratory experiment, these outliers should not be excluded from the analysis. It is difficult to use the simple ANOVA to analyze such a data.

The double box plot chart opens a way to analyze the data including different origins, having too small size, and containing extreme outliers. Saying with rough words, it is a statistical tool to rediscover poor quality data.

To draw double box plot charts, you need to get a source of boxplotdou function.

http://code.google.com/p/cowares-excel-hello/wiki/boxplotdou_r

> source("boxplotdou.R") > source("has.overlap.R")

> source("test.boxplotdou.R") > test1() ...

> str(x) 'data.frame': 100 obs of 2 variables: $factor.x: chr jp jp us us ru ru ... $obs.x: num 12.81 9.09 9.03 10.89 9.67 ... > str(y) 'data.frame':. 50 obs of 2 variables: $factor.y: chr us us us ru ru de ... $obs.y: num 12.69 9.13 9.11 11.03 9.64 ...

Prepare data frames such as the above into variable x and y. These are pairs of category (factor) and observed value.

> boxplotdou(x, y)

This draws a double box plot chart.

- Box-and-whiskers are categorized by color.
- Using short abbreviation for the factor character is a good way, because they are drawn on the chart.
- This chart is showing the relationship between obs.x and obs.y observed separately and bound by the factor.x and factor.y.
- Calculation depends the basic boxplot() function and equivalent to that.

Set TRUE if you need a rectangular frame rather than whiskers to show the minimum and maximum range. It will help to see the overlap of outer ranges.

Set TRUE if you need whiskers at the point of outliers. It will help to enhance outliers visually.

Set FALSE if you do not need category names on the right and top axes.

Set FALSE if you need the result table rather than the chart.

Set TRUE if you need a lot of output of calculation process. Primarily for debugging purposes.

condense = FALSE

condense.severity = “iqr”

condense.once = FALSE

Set **condense** = TRUE

if you need to unify near categories into one box-and-whisker.

Near categories are needless to distinguish, and unifying them will enlarge the sample size of the category. It also make the chart clearly seen.

The c**ondense.severity** determines the criteria of “near”.

condense.severity = “iqr”

Categories are considered “near”, when the square which is surrounded by the first and third quartile has an overlap.

condense.severity = “whisker”

Categories are considered “near”, when the square which is surrounded by the whisker has an overlap.

condense.severity = “whisker.xory” and condense.severity = “iqr.xory”

Adding “.xory” is moderating the criteria; the square is considered overlap when the either x-axis or y-axis has an overlap.

Set **condense.once** = TRUE

if you need to perform the condense only once.

The default value (FALSE) performs the condense repeatedly until it converges.

Primarily for debugging purposes.

When there is no correlation, the box-and-whiskers overlap each other, and be arranged randomly.

The box-and-whiskers overlap each other. Thus the decision is difficult.

However, when the correlation is strong, there may be a tendency of their arrangement.

When there is a correlation between categories, x and y, the position of the box-and-whiskers can be separated significantly in both x-axis and y-axis.

When there is a correlation between the category and x, box-and-whiskers separate only at x-axis.

Advertisements

The stupidest thing...

Statistics, genetics, programming, academics

ЯтомизоnoR

R, Statistics

%d bloggers like this:

Pingback: Double Box Plot package boxplotdbl 1.2.0 released | ЯтомизоnoR

How can I delete the labest within the boxes?

`boxplotdou(iris[c(5,1)], iris[c(5,2)], factor.labels=F)`

This factor.labels argument is supported on 19 Oct. So download the recent one and try above, please.

Done, thanks!

Pingback: boxplotdbl and diaplt Packages 1.0.0 Public Beta are Available | ЯтомизоnoR

Pingback: Draw an Ellipse Summary Plot in R | ЯтомизоnoR

Pingback: How to color box and whisker plot | ЯтомизоnoR

Pingback: Draw a Double Box Plot Chart (2-Axes Box Plot; Box Plot Correlation Diagram) in R | R for Journalists | Scoop.it