ЯтомизоnoR

R, Statistics

Distribution, Density and Histogram

A histogram of samples is expected to converge to the density curve of the population, when the sample number is large.

Fig. 1. Probability density functions and Hisotgrams of random samples

Fig. 1. Probability density functions and Hisotgrams of random samples

Fig. 2. Probability density functions and Hisotgrams of random samples with large sample number

Fig. 2. Probability density functions and Hisotgrams of random samples with large sample number

Above all charts are based on known distributions as shown below formulas.

# (a) flat
r1 <- function(n) runif(n, 1, 9)
d1 <- function(x) dunif(x, 1, 9)

# (b) normal
r2 <- function(n) rnorm(n, 5)
d2 <- function(x) dnorm(x, 5)

# (c) skewed
r3 <- function(n) rnorm(n, 5) ^ 2
d3 <- function(x) dnorm(sqrt(x), 5) * 0.1

# (d) twin
r4 <- function(n) ifelse(runif(n) < 0.5, rnorm(n, 5), rnorm(n, 9, 2))
d4 <- function(x) (dnorm(x, 5) + dnorm(x, 9, 2)) * 0.5

And charts are drawn with this.

plot.rd <- function(R=r1, D=d1, n=100, min=0, max=10, yrat=1.0, text='') {
  data <- R(n)
  h <- hist(data, plot=F)
  xlim <- c(min, max)
  ylim <- c(0, yrat * max(h$density))
  h <- hist(data, freq=F, xlim=xlim, ylim=ylim, xlab='', main='', col='aquamarine')
  par(new=T)
  curve(D, min, max, xlim=xlim, ylim=ylim, ylab='', main=paste('n =', n), col='salmon', lwd=3)
  mtext(text, side=3, at=0, col='royalblue')
  invisible(list(data=data, xlim=xlim, ylim=ylim))
}

Fig.1 <- function(n=100) {
  parkeeper <- par(no.readonly=T)
  par(mfrow=c(2,2), mar=c(4,4,1,1))

  plot.rd(R=r1, D=d1, n=n, min=0, max=10, yrat=1.0 ,text='(a) flat')
  plot.rd(R=r2, D=d2, n=n, min=0, max=10, yrat=1.0 ,text='(b) normal')
  plot.rd(R=r3, D=d3, n=n, min=0, max=70, yrat=1.0 ,text='(c) skewed')
  plot.rd(R=r4, D=d4, n=n, min=0, max=16, yrat=1.3 ,text='(d) twin')

  par(parkeeper)
}

Fig.2 <- function() Fig.1(n=10000)

But usually I want to know the curve of unknown distribution. If (and only if) the distribution resembles to the normal distribution, I can use the mean and the standard deviation to calculate the curve as a normal distribution.

Fig. 3. Guessed density curves as a normal distribution

Fig. 3. Guessed density curves as a normal distribution

This method does not fit to flat, skewed, or two peak distribution.

plot.guess.as.normal <- function(bag) {
  m <- mean(bag$data)
  s <- sd(bag$data)
  D <- function(x) dnorm(x, mean=m, sd=s)
  par(new=T)
  curve(D, bag$xlim[1], bag$xlim[2], xlim=bag$xlim, ylim=bag$ylim, xlab='', ylab='', main='', col='steelblue', lwd=3)
  mtext('normal', side=3, padj=1, col='steelblue')
  mtext(sprintf('%s%5.1f', 'mean', m), side=3, padj=2, col='steelblue')
  mtext(sprintf('%s%5.1f', ' sd ', s), side=3, padj=3, col='steelblue')
}

Fig.3 <- function(n=100) {
  parkeeper <- par(no.readonly=T)
  par(mfrow=c(2,2), mar=c(4,4,1,1))

  bag <- plot.rd(R=r1, D=d1, n=n, min=0, max=10, yrat=1.3 ,text='(a) flat')
  plot.guess.as.normal(bag)
  bag <- plot.rd(R=r2, D=d2, n=n, min=0, max=10, yrat=1.3 ,text='(b) normal')
  plot.guess.as.normal(bag)
  bag <- plot.rd(R=r3, D=d3, n=n, min=0, max=70, yrat=1.3 ,text='(c) skewed')
  plot.guess.as.normal(bag)
  bag <- plot.rd(R=r4, D=d4, n=n, min=0, max=16, yrat=1.4 ,text='(d) twin')
  plot.guess.as.normal(bag)

  par(parkeeper)
}

Fortunately the R has a function density() to solve this.

Fig. 4. Guessed density curves by the density() function

Fig. 4. Guessed density curves by the density() function

It works nicely.

plot.guess.by.density <- function(bag, bw='SJ') {
  m <- mean(bag$data)
  s <- sd(bag$data)
  d <- density(bag$data, bw=bw)
  par(new=T)
  plot(d, xlim=bag$xlim, ylim=bag$ylim, xlab='', ylab='', main='', col='steelblue', lwd=3)
  mtext("density()", side=3, padj=2, col='steelblue')
  mtext(sprintf('%s%6.2f', 'bandwidth', d$bw), side=3, padj=3, col='steelblue')
}

Fig.4 <- function(n=100) {
  parkeeper <- par(no.readonly=T)
  par(mfrow=c(2,2), mar=c(4,4,1,1))

  bag <- plot.rd(R=r1, D=d1, n=n, min=0, max=10, yrat=1.3 ,text='(a) flat')
  plot.guess.by.density(bag)
  bag <- plot.rd(R=r2, D=d2, n=n, min=0, max=10, yrat=1.3 ,text='(b) normal')
  plot.guess.by.density(bag)
  bag <- plot.rd(R=r3, D=d3, n=n, min=0, max=70, yrat=1.3 ,text='(c) skewed')
  plot.guess.by.density(bag)
  bag <- plot.rd(R=r4, D=d4, n=n, min=0, max=16, yrat=1.4 ,text='(d) twin')
  plot.guess.by.density(bag)

  par(parkeeper)
}

The bandwidth is the key of the density function. As a default, it uses a result of bw.nrd0() function. I chose bw=’SJ’ to use a result of bw.SJ() function, that gave an optimized value by Sheather & Jones (1991). Results are deeply governed by the choice of bandwidth.

Fig. 5. Various bandwidth

Fig. 5. Various bandwidth

The fitness is improved as the sample number increases.

Fig. 6. Various bandwidth with large sample number

Fig. 6. Various bandwidth with large sample number

plot.guess.by.density.bw <- function(bag, bw, col, offset) {
  m <- mean(bag$data)
  s <- sd(bag$data)
  d <- density(bag$data, bw=bw)
  par(new=T)
  plot(d, xlim=bag$xlim, ylim=bag$ylim, xlab='', ylab='', main='', col=col, lwd=1, lty='solid')
  mtext(sprintf('%6.2f', d$bw), side=3, at=0, padj=offset, col=col)
}

Fig.5 <- function(n=100, R=r4, D=d4) {
  bag <- plot.rd(R=R, D=D, n=n, min=0, max=16, yrat=1.5 ,text='')
  bw <- bw.SJ(bag$data)
  col <- rainbow(7)
  plot.guess.by.density.bw(bag, bw / 4, col[1], 1)
  plot.guess.by.density.bw(bag, bw / 2, col[2], 2)
  plot.guess.by.density.bw(bag, bw,     col[5], 3)
  plot.guess.by.density.bw(bag, bw * 2, col[6], 4)
  plot.guess.by.density.bw(bag, bw * 4, col[7], 5)
}

Fig.6 <- function() Fig.5(1000)
About these ads

One comment on “Distribution, Density and Histogram

  1. Pingback: Distribution, Density and Histogram | R for Jou...

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Information

This entry was posted on May 11, 2013 by and tagged , , , , , , .
The stupidest thing...

Statistics, genetics, programming, academics

ЯтомизоnoR

R, Statistics

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: