Question 1
Run the sample code. Set up a new .Rmd file for this exercise. Copy and paste the code below into different code chunks, and then read the text and run the code chunks one at a time to see what they do. You probably won’t understand everything in the code, but this is a good start for seeing some realisticuses of ggplot. We will cover most of these details in the next few weeks.
## 'data.frame': 1763 obs. of 2 variables:
## $ ID : int 1 2 10 11 12 14 16 18 19 23 ...
## $ myVar: num 0.341 1.011 0.988 0.161 0.987 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000149 0.369842 0.740881 0.864342 1.234615 3.580644
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## mean sd
## 0.86434207 0.62897115
## (0.01497975) (0.01059228)
## List of 5
## $ estimate: Named num [1:2] 0.864 0.629
## ..- attr(*, "names")= chr [1:2] "mean" "sd"
## $ sd : Named num [1:2] 0.015 0.0106
## ..- attr(*, "names")= chr [1:2] "mean" "sd"
## $ vcov : num [1:2, 1:2] 0.000224 0 0 0.000112
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "mean" "sd"
## .. ..$ : chr [1:2] "mean" "sd"
## $ n : int 1763
## $ loglik : num -1684
## - attr(*, "class")= chr "fitdistr"
## mean
## 0.8643421
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).
Question 2
Try it with your own data. Once the code is in and runs, try running this analysis on your own data (or data from your lab, or data from a published paper; if you’re stumped, check out publicly available data sets on Dryad, ESA, or the LTER Network). Find a vector of data (of any size), set it up in a .csv file, and read the data into a data frame with this code chunk:
I choose to use data from my master’s degree.
###Question2##
<- read.table("DMC.master.csv",header=TRUE,sep=",")
DMC str(DMC)
## 'data.frame': 96 obs. of 21 variables:
## $ Block : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Plot : int 1 1 1 1 1 1 2 2 2 2 ...
## $ Locality : int 1 1 2 2 3 3 1 1 2 2 ...
## $ Pool : int 1 2 1 2 1 2 1 2 1 2 ...
## $ PoolID : chr "11-11" "11-12" "11-21" "11-22" ...
## $ LL : num 0.5 0.5 0.1 0.5 0.1 0.1 0.1 0.1 0.5 0.1 ...
## $ PoolType : chr "H" "H" "L" "H" ...
## $ LocalType : chr "H" "H" "M" "M" ...
## $ LarvaePlot : chr "Y" "Y" "Y" "Y" ...
## $ LarvaeLocal : chr "Y" "Y" "Y" "Y" ...
## $ LarvaePool : chr "Y" "Y" "N" "Y" ...
## $ Pool.Local.Comp: chr "HHY" "HHY" "LMY" "HMY" ...
## $ Prop : num 0.586 0.414 0.154 0.846 0.618 ...
## $ Sum : int 518 366 100 550 115 71 105 76 328 87 ...
## $ Day.1 : int 3 7 4 0 5 6 8 3 5 7 ...
## $ Day.2 : int 29 37 7 56 8 1 9 21 47 22 ...
## $ Day.3 : int 70 58 6 62 17 17 25 10 50 4 ...
## $ Day.4 : int 72 45 32 143 40 8 11 12 48 17 ...
## $ Day.5 : int 92 76 30 175 12 16 23 2 63 21 ...
## $ Day.6 : int 116 77 18 61 11 14 19 13 38 6 ...
## $ Day.7 : int 136 66 3 53 22 9 10 15 77 10 ...
summary(DMC)
## Block Plot Locality Pool PoolID
## Min. :1.00 Min. :1.0 Min. :1 Min. :1.0 Length:96
## 1st Qu.:2.75 1st Qu.:1.0 1st Qu.:1 1st Qu.:1.0 Class :character
## Median :4.50 Median :1.5 Median :2 Median :1.5 Mode :character
## Mean :4.50 Mean :1.5 Mean :2 Mean :1.5
## 3rd Qu.:6.25 3rd Qu.:2.0 3rd Qu.:3 3rd Qu.:2.0
## Max. :8.00 Max. :2.0 Max. :3 Max. :2.0
##
## LL PoolType LocalType LarvaePlot
## Min. :0.1 Length:96 Length:96 Length:96
## 1st Qu.:0.1 Class :character Class :character Class :character
## Median :0.3 Mode :character Mode :character Mode :character
## Mean :0.3
## 3rd Qu.:0.5
## Max. :0.5
##
## LarvaeLocal LarvaePool Pool.Local.Comp Prop
## Length:96 Length:96 Length:96 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:0.3701
## Mode :character Mode :character Mode :character Median :0.5000
## Mean :0.5000
## 3rd Qu.:0.6299
## Max. :1.0000
##
## Sum Day.1 Day.2 Day.3
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 106.5 1st Qu.: 5.00 1st Qu.: 9.00 1st Qu.: 12.50
## Median : 255.5 Median : 23.00 Median : 27.00 Median : 34.00
## Mean : 375.4 Mean : 29.83 Mean : 40.91 Mean : 59.45
## 3rd Qu.: 497.8 3rd Qu.: 48.25 3rd Qu.: 55.25 3rd Qu.: 80.00
## Max. :2008.0 Max. :143.00 Max. :181.00 Max. :347.00
## NA's :1
## Day.4 Day.5 Day.6 Day.7
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 13.00 1st Qu.: 12.00 1st Qu.: 9.00 1st Qu.: 3.75
## Median : 31.00 Median : 36.50 Median : 30.50 Median : 21.00
## Mean : 71.09 Mean : 72.59 Mean : 53.16 Mean : 48.94
## 3rd Qu.: 94.50 3rd Qu.:100.00 3rd Qu.: 78.50 3rd Qu.: 53.25
## Max. :427.00 Max. :419.00 Max. :293.00 Max. :597.00
##
#plot
<- ggplot(data=DMC, aes(x=Sum, y=..density..)) +
p1 geom_histogram(color="grey60",fill="cornsilk",size=0.2)
print(p1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#adds curve to plot
<- p1 + geom_density(linetype="dotted",size=0.75)
p1 print(p1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#fitting a normal distribution
<- fitdistr(DMC$Sum,"normal")
normPars print(normPars)
## mean sd
## 375.35417 381.91450
## ( 38.97899) ( 27.56231)
str(normPars)
## List of 5
## $ estimate: Named num [1:2] 375 382
## ..- attr(*, "names")= chr [1:2] "mean" "sd"
## $ sd : Named num [1:2] 39 27.6
## ..- attr(*, "names")= chr [1:2] "mean" "sd"
## $ vcov : num [1:2, 1:2] 1519 0 0 760
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "mean" "sd"
## .. ..$ : chr [1:2] "mean" "sd"
## $ n : int 96
## $ loglik : num -707
## - attr(*, "class")= chr "fitdistr"
$estimate["mean"] # note structure of getting a named attribute normPars
## mean
## 375.3542
#Plot Normal
<- normPars$estimate["mean"]
meanML <- normPars$estimate["sd"]
sdML
<- seq(0,max(DMC$Sum),len=length(DMC$Sum))
xval
<- stat_function(aes(x = xval, y = ..y..), fun = dnorm, colour="red", n = length(DMC$Sum), args = list(mean = meanML, sd = sdML))
stat + stat p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Plot Exponential Dist
<- fitdistr(DMC$Sum,"exponential")
expoPars <- expoPars$estimate["rate"]
rateML
<- stat_function(aes(x = xval, y = ..y..), fun = dexp, colour="blue", n = length(DMC$Sum), args = list(rate=rateML))
stat2 + stat + stat2 p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Plot uniform prob. dist
<- stat_function(aes(x = xval, y = ..y..), fun = dunif, colour="darkgreen", n = length(DMC$Sum), args = list(min=min(DMC$Sum), max=max(DMC$Sum)))
stat3 + stat + stat2 + stat3 p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
########My data is not fitting a gamma distribution, so I didn't run this########
#Plot Gamma prob dist
#gammaPars <- fitdistr(DMC$sum,"gamma")
#shapeML <- gammaPars$estimate["shape"]
#rateML <- gammaPars$estimate["rate"]
#stat4 <- stat_function(aes(x = xval, y = ..y..), fun = dgamma, colour="brown", n = length(DMC$Sum), args = list(shape=shapeML, rate=rateML))
#p1 + stat + stat2 + stat3 + stat4
########My data is not fitting a beta distribution, so I didn't run this########
#Plot Beta prob dens
# pSpecial <- ggplot(data=DMC, aes(x=Sum/(max(Sum + 0.1)), y=..density..)) +
# geom_histogram(color="grey60",fill="cornsilk",size=0.2) +
# xlim(c(0,1)) +
# geom_density(size=0.75,linetype="dotted")
#
# betaPars <- fitdistr(x=DMC$Sum/max(DMC$Sum + 0.1),start=list(shape1=1,shape2=2),"beta")
# shape1ML <- betaPars$estimate["shape1"]
# shape2ML <- betaPars$estimate["shape2"]
#
# statSpecial <- stat_function(aes(x = xval, y = ..y..), fun = dbeta, colour="orchid", n = length(DMC$Sum), args = list(shape1=shape1ML,shape2=shape2ML))
# pSpecial + statSpecial
#
Question 3:
Find best-fitting distribution. Take a look at the second-to-last graph which shows the histogram of your data and 4 probability density curves (normal, uniform, exponential, gamma) that are fit to the data. Find the best-fitting distribution for your data. For most data sets, the gamma will probably fit best, but if you data set is small, it may be very hard to see much of a difference between the curves. The beta distribution in the final graph is somewhat special. It often fits the data pretty well, but that is because we have assumed the largest data point is the true upper bound, and everything is scaled to that. The fit of the uniform distribution also fixes the upper bound. The other curves (normal, exponential, and gamma) are more realistic because they do not have an upper bound.
I found that the best distribution for my data was a exponential distribution
Question 4
Simulate a new data set. Using the best-fitting distribution, go back to the code and get the maximum likelihood parameters. Use those to simulate a new data set, with the same length as your original vector, and plot that in a histogram and add the probability density curve. Right below that, generate a fresh histogram plot of the original data, and also include the probability density curve.
#setup fake data
<- rexp(n = 96, rate = 0.002664151)
dmcDUMMY <- data.frame(1:96, dmcDUMMY)
dmcDUMMY names(dmcDUMMY) <- list("ID","sum")
<- fitdistr(dmcDUMMY$sum,"exponential")
expoPars <- expoPars$estimate["rate"]
rateML #close to set rate rateML
## rate
## 0.002408411
<- ggplot(data=dmcDUMMY, aes(x=sum, y=..density..)) +
dummy geom_histogram(color="grey60",fill="cornsilk",size=0.2)
<- stat_function(aes(x = xval, y = ..y..), fun = dexp, colour="blue", n = length(dmcDUMMY$sum), args = list(rate=rateML))
dummy_exp_line + dummy_exp_line dummy
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
How do the two histogram profiles compare? Do you think the model is doing a good job of simulating realistic data that match your original measurements? Why or why not?
The two histograms of data are fairly similar (see graphs below). The model is doing a good job of simulating the data. This is likely because my data was a strong fit to an exponential distribution.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.