Question 1

Run the sample code. Set up a new .Rmd file for this exercise. Copy and paste the code below into different code chunks, and then read the text and run the code chunks one at a time to see what they do. You probably won’t understand everything in the code, but this is a good start for seeing some realisticuses of ggplot. We will cover most of these details in the next few weeks.

## 'data.frame':    1763 obs. of  2 variables:
##  $ ID   : int  1 2 10 11 12 14 16 18 19 23 ...
##  $ myVar: num  0.341 1.011 0.988 0.161 0.987 ...
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000149 0.369842 0.740881 0.864342 1.234615 3.580644
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##       mean          sd    
##   0.86434207   0.62897115 
##  (0.01497975) (0.01059228)
## List of 5
##  $ estimate: Named num [1:2] 0.864 0.629
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ sd      : Named num [1:2] 0.015 0.0106
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ vcov    : num [1:2, 1:2] 0.000224 0 0 0.000112
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "mean" "sd"
##   .. ..$ : chr [1:2] "mean" "sd"
##  $ n       : int 1763
##  $ loglik  : num -1684
##  - attr(*, "class")= chr "fitdistr"
##      mean 
## 0.8643421
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).

Question 2

Try it with your own data. Once the code is in and runs, try running this analysis on your own data (or data from your lab, or data from a published paper; if you’re stumped, check out publicly available data sets on Dryad, ESA, or the LTER Network). Find a vector of data (of any size), set it up in a .csv file, and read the data into a data frame with this code chunk:

I choose to use data from my master’s degree.

###Question2##
DMC <- read.table("DMC.master.csv",header=TRUE,sep=",")
str(DMC)
## 'data.frame':    96 obs. of  21 variables:
##  $ Block          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Plot           : int  1 1 1 1 1 1 2 2 2 2 ...
##  $ Locality       : int  1 1 2 2 3 3 1 1 2 2 ...
##  $ Pool           : int  1 2 1 2 1 2 1 2 1 2 ...
##  $ PoolID         : chr  "11-11" "11-12" "11-21" "11-22" ...
##  $ LL             : num  0.5 0.5 0.1 0.5 0.1 0.1 0.1 0.1 0.5 0.1 ...
##  $ PoolType       : chr  "H" "H" "L" "H" ...
##  $ LocalType      : chr  "H" "H" "M" "M" ...
##  $ LarvaePlot     : chr  "Y" "Y" "Y" "Y" ...
##  $ LarvaeLocal    : chr  "Y" "Y" "Y" "Y" ...
##  $ LarvaePool     : chr  "Y" "Y" "N" "Y" ...
##  $ Pool.Local.Comp: chr  "HHY" "HHY" "LMY" "HMY" ...
##  $ Prop           : num  0.586 0.414 0.154 0.846 0.618 ...
##  $ Sum            : int  518 366 100 550 115 71 105 76 328 87 ...
##  $ Day.1          : int  3 7 4 0 5 6 8 3 5 7 ...
##  $ Day.2          : int  29 37 7 56 8 1 9 21 47 22 ...
##  $ Day.3          : int  70 58 6 62 17 17 25 10 50 4 ...
##  $ Day.4          : int  72 45 32 143 40 8 11 12 48 17 ...
##  $ Day.5          : int  92 76 30 175 12 16 23 2 63 21 ...
##  $ Day.6          : int  116 77 18 61 11 14 19 13 38 6 ...
##  $ Day.7          : int  136 66 3 53 22 9 10 15 77 10 ...
summary(DMC)
##      Block           Plot        Locality      Pool        PoolID         
##  Min.   :1.00   Min.   :1.0   Min.   :1   Min.   :1.0   Length:96         
##  1st Qu.:2.75   1st Qu.:1.0   1st Qu.:1   1st Qu.:1.0   Class :character  
##  Median :4.50   Median :1.5   Median :2   Median :1.5   Mode  :character  
##  Mean   :4.50   Mean   :1.5   Mean   :2   Mean   :1.5                     
##  3rd Qu.:6.25   3rd Qu.:2.0   3rd Qu.:3   3rd Qu.:2.0                     
##  Max.   :8.00   Max.   :2.0   Max.   :3   Max.   :2.0                     
##                                                                           
##        LL        PoolType          LocalType          LarvaePlot       
##  Min.   :0.1   Length:96          Length:96          Length:96         
##  1st Qu.:0.1   Class :character   Class :character   Class :character  
##  Median :0.3   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.3                                                           
##  3rd Qu.:0.5                                                           
##  Max.   :0.5                                                           
##                                                                        
##  LarvaeLocal         LarvaePool        Pool.Local.Comp         Prop       
##  Length:96          Length:96          Length:96          Min.   :0.0000  
##  Class :character   Class :character   Class :character   1st Qu.:0.3701  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.5000  
##                                                           Mean   :0.5000  
##                                                           3rd Qu.:0.6299  
##                                                           Max.   :1.0000  
##                                                                           
##       Sum             Day.1            Day.2            Day.3       
##  Min.   :   0.0   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 106.5   1st Qu.:  5.00   1st Qu.:  9.00   1st Qu.: 12.50  
##  Median : 255.5   Median : 23.00   Median : 27.00   Median : 34.00  
##  Mean   : 375.4   Mean   : 29.83   Mean   : 40.91   Mean   : 59.45  
##  3rd Qu.: 497.8   3rd Qu.: 48.25   3rd Qu.: 55.25   3rd Qu.: 80.00  
##  Max.   :2008.0   Max.   :143.00   Max.   :181.00   Max.   :347.00  
##                                                     NA's   :1       
##      Day.4            Day.5            Day.6            Day.7       
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 13.00   1st Qu.: 12.00   1st Qu.:  9.00   1st Qu.:  3.75  
##  Median : 31.00   Median : 36.50   Median : 30.50   Median : 21.00  
##  Mean   : 71.09   Mean   : 72.59   Mean   : 53.16   Mean   : 48.94  
##  3rd Qu.: 94.50   3rd Qu.:100.00   3rd Qu.: 78.50   3rd Qu.: 53.25  
##  Max.   :427.00   Max.   :419.00   Max.   :293.00   Max.   :597.00  
## 
#plot
p1 <- ggplot(data=DMC, aes(x=Sum, y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2) 
print(p1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#adds curve to plot
p1 <-  p1 +  geom_density(linetype="dotted",size=0.75)
print(p1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#fitting a normal distribution
normPars <- fitdistr(DMC$Sum,"normal")
print(normPars)
##      mean         sd    
##   375.35417   381.91450 
##  ( 38.97899) ( 27.56231)
str(normPars)
## List of 5
##  $ estimate: Named num [1:2] 375 382
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ sd      : Named num [1:2] 39 27.6
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ vcov    : num [1:2, 1:2] 1519 0 0 760
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "mean" "sd"
##   .. ..$ : chr [1:2] "mean" "sd"
##  $ n       : int 96
##  $ loglik  : num -707
##  - attr(*, "class")= chr "fitdistr"
normPars$estimate["mean"] # note structure of getting a named attribute
##     mean 
## 375.3542
#Plot Normal 
meanML <- normPars$estimate["mean"]
sdML <- normPars$estimate["sd"]

xval <- seq(0,max(DMC$Sum),len=length(DMC$Sum))

stat <- stat_function(aes(x = xval, y = ..y..), fun = dnorm, colour="red", n = length(DMC$Sum), args = list(mean = meanML, sd = sdML))
p1 + stat
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Plot Exponential Dist
expoPars <- fitdistr(DMC$Sum,"exponential")
rateML <- expoPars$estimate["rate"]

stat2 <- stat_function(aes(x = xval, y = ..y..), fun = dexp, colour="blue", n = length(DMC$Sum), args = list(rate=rateML))
p1 + stat + stat2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Plot uniform prob. dist
stat3 <- stat_function(aes(x = xval, y = ..y..), fun = dunif, colour="darkgreen", n = length(DMC$Sum), args = list(min=min(DMC$Sum), max=max(DMC$Sum)))
p1 + stat + stat2 + stat3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

########My data is not fitting a gamma distribution, so I didn't run this########
#Plot Gamma prob dist
#gammaPars <- fitdistr(DMC$sum,"gamma")
#shapeML <- gammaPars$estimate["shape"]
#rateML <- gammaPars$estimate["rate"]


#stat4 <- stat_function(aes(x = xval, y = ..y..), fun = dgamma, colour="brown", n = length(DMC$Sum), args = list(shape=shapeML, rate=rateML))
#p1 + stat + stat2 + stat3 + stat4

########My data is not fitting a beta distribution, so I didn't run this########
#Plot Beta prob dens
# pSpecial <- ggplot(data=DMC, aes(x=Sum/(max(Sum + 0.1)), y=..density..)) +
#   geom_histogram(color="grey60",fill="cornsilk",size=0.2) + 
#   xlim(c(0,1)) +
#   geom_density(size=0.75,linetype="dotted")
# 
# betaPars <- fitdistr(x=DMC$Sum/max(DMC$Sum + 0.1),start=list(shape1=1,shape2=2),"beta")
# shape1ML <- betaPars$estimate["shape1"]
# shape2ML <- betaPars$estimate["shape2"]
# 
# statSpecial <- stat_function(aes(x = xval, y = ..y..), fun = dbeta, colour="orchid", n = length(DMC$Sum), args = list(shape1=shape1ML,shape2=shape2ML))
# pSpecial + statSpecial
#

Question 3:

Find best-fitting distribution. Take a look at the second-to-last graph which shows the histogram of your data and 4 probability density curves (normal, uniform, exponential, gamma) that are fit to the data. Find the best-fitting distribution for your data. For most data sets, the gamma will probably fit best, but if you data set is small, it may be very hard to see much of a difference between the curves. The beta distribution in the final graph is somewhat special. It often fits the data pretty well, but that is because we have assumed the largest data point is the true upper bound, and everything is scaled to that. The fit of the uniform distribution also fixes the upper bound. The other curves (normal, exponential, and gamma) are more realistic because they do not have an upper bound.

I found that the best distribution for my data was a exponential distribution

Question 4

Simulate a new data set. Using the best-fitting distribution, go back to the code and get the maximum likelihood parameters. Use those to simulate a new data set, with the same length as your original vector, and plot that in a histogram and add the probability density curve. Right below that, generate a fresh histogram plot of the original data, and also include the probability density curve.

#setup fake data
dmcDUMMY <- rexp(n = 96, rate = 0.002664151)
dmcDUMMY <- data.frame(1:96, dmcDUMMY)
names(dmcDUMMY) <- list("ID","sum")

expoPars <- fitdistr(dmcDUMMY$sum,"exponential")
rateML <- expoPars$estimate["rate"]
rateML #close to set rate
##        rate 
## 0.002408411
dummy <- ggplot(data=dmcDUMMY, aes(x=sum, y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2) 

dummy_exp_line <- stat_function(aes(x = xval, y = ..y..), fun = dexp, colour="blue", n = length(dmcDUMMY$sum), args = list(rate=rateML))
dummy + dummy_exp_line
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How do the two histogram profiles compare? Do you think the model is doing a good job of simulating realistic data that match your original measurements? Why or why not?

The two histograms of data are fairly similar (see graphs below). The model is doing a good job of simulating the data. This is likely because my data was a strong fit to an exponential distribution.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.