--- title: "Lab-Regression" author: "J. Lucas McKay" date: "2023-03-29" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = T, collapse = T, warning = F, message = F) ``` *** #### 1. Equivalence of F and t tests when there are two groups ```{r} library(tidyverse) mean0 = function(x) round(mean(x,na.rm = T),0) sd0 = function(x) round(sd(x,na.rm = T),0) ss0 = function(x) round(sum((x-mean(x))^2),0) n = 50 set.seed(510) durations = bind_rows( tibble(Gender = "Male", Years = rnorm(n,74,10)), tibble(Gender = "Female", Years = rnorm(n,82,10))) |> mutate(Gender = forcats::fct_rev(factor(Gender))) # make a dot plot durations |> ggplot(aes(Gender,Years)) + geom_jitter(width = 0.1) + expand_limits(y=0) # box plot. note that outlier.shape=NA prevents outliers from being drawn, which is useful. durations |> ggplot(aes(Gender,Years)) + geom_boxplot(outlier.shape=NA) + theme_minimal() ``` ```{r} # Analysis of Variance lm(Years~Gender,data = durations) |> anova() # The default t test is actually a Welch, as you recall. t.test(Years~Gender,data = durations) # In order to exactly approximate the F, we must use the "var.equal" flag. t.test(Years~Gender,data = durations,var.equal=T) ``` *** #### 2. Working with regression diagnostics One other thing to note is that the fitted object contains several useful pieces of functionality. First of all, plotting the object shows several diagnostic plots. ```{r} mod = lm(Years~Gender,data = durations) plot(mod) # The residual plot is probably the most important, as it allows you to assess the assumptions of the linear regression. # Note that observations with a lot of leverage are noted by row number. ``` ```{r} # Let's try something with more of a dense x axis. set = read_csv("https://jlucasmckay.bmi.emory.edu/global/bmi510/Labs-Materials/set-shifting.csv") # This dataset contains various outcomes from a paper I did in 2018. # https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5960619/ m_set = lm(trails_a~moca_score,data = set) plot(m_set) # 1. We can see that there is no major slope or nonlinearity in residuals vs. fitted, which means that the linear relationship is probably ok. If there was a parabola here we might consider a different function for the relationship, like a polynomial. # 2. The Q-Q plot shows that some of the extreme values are laying far from the linear relationship. To ameliorate this you could perhaps do a variance transformation, or just iterate the analysis on the middle of the data and demonstrate that your slope and p value were not particularly sensitive to these cases. # 3. Scale-Location is really designed to show the homogeneity of the residuals, I do not think it adds too much to the first plot. This kind of has a squashed oval shape, which you would expect if fewer x values were at the extrema. # 4. The Cook's distance plot is used to identify data that are exerting undue leverage on the relationship. In this case we cannot actually see the Cook's distance cutoffs. # Let's try another relationship that clearly should not be a linear regression x = set$moca_score y = set$trails_a + 17*set$trails_b^3 plot(lm(y~x)) x = set$moca_score y = set$trails_a + 17*set$trails_b^3+5*set$trails_b>60 plot(lm(y~x)) # We would guess that overall cognitive performance would predict psychomotor speed. set |> ggplot(aes(moca_score,trails_a)) + geom_point() + geom_smooth() # To get a nice linear fit, we set se=F and set the model to linear. set |> ggplot(aes(moca_score,trails_a)) + geom_point() + geom_smooth(se = F, method = "lm", fullrange=T) # To get a nice linear fit, we set se=F and set the model to linear. set |> ggplot(aes(moca_score,trails_a,color = factor(haspd))) + geom_point() + geom_smooth(se = F, method = "lm") # Note function of fullrange argument set |> ggplot(aes(moca_score,trails_a,color = factor(haspd))) + geom_point() + geom_smooth(se = F, method = "lm", fullrange=T) ``` A note - we will get to interaction terms soon. How would we test whether the presence of PD modifies the relationship between trails_a and MoCA score? *** #### 3. Extracting elements of models ```{r} mod = lm(trails_a~moca_score,data = set) # extract coefficients betas = coef(mod) # extract variance of coefficients - need to pull diagonal for this beta_vars = diag(vcov(mod)) beta_ses = sqrt(beta_vars) t_stats = betas / beta_ses # these are distributed as a t with n-2 degrees of freedom p_vals = 2 * pt(abs(t_stats), df = df.residual(mod), lower.tail = FALSE) # you can access the residuals in two ways residuals(mod) mod$residuals # what will this be? mean(mod$residuals) # either way is sufficient to calculate the residual sum of squares rss = sum(mod$residuals^2) # we can extract the degrees of freedom for the residuals as well df.residual(mod) rse = rss/df.residual(mod) # how could we get the SSMOD? sst = sum((set$trails_a - mean(set$trails_a))^2) # notice that there are two missing values in this vector, which lm has already accommodated sst = sum((set$trails_a - mean(set$trails_a,na.rm=T))^2,na.rm=T) # we can get the model sum of squares by subtracting the residuals from the total. ssmod = sst - rss # compare to ANOVA anova(mod) ``` *** #### 4. R^2 and the correlation coefficient ```{r} # just for nomenclature purposes let's say "sum squares residuals" ssr = rss r_squared = ssmod/sst cor(set$trails_a,set$moca_score, use = "com") cor(set$trails_a,set$moca_score, use = "com")^2 ``` *** #### 5. Confidence and prediction intervals. ```{r} mod = lm(trails_a~moca_score,data = set) newdata = tibble(moca_score = seq(10,30,1)) # confidence interval for average predicted value for each value of x predict(mod, newdata, interval="confidence") # confidence interval for predicted values for each value of x predict(mod, newdata, interval="predict") # plot predict(mod, newdata, interval="confidence") |> as_tibble() |> bind_cols(newdata) |> pivot_longer(-moca_score) |> arrange(name,value) |> ggplot(aes(moca_score,value,color = name)) + geom_line(show.legend = F) + theme_minimal() + scale_color_viridis_d(end=0.8) # cute plot predict(mod, newdata, interval="prediction") |> as_tibble() |> bind_cols(newdata) |> pivot_longer(-moca_score) |> arrange(name,value) |> ggplot(aes(moca_score,value,color = name)) + geom_line(show.legend = F) + theme_minimal() + scale_color_viridis_d(end=0.8) ``` *** #### 6. Variance-normalizing transforms. The Box-Cox transformation depends on the model at hand as well as the individual y and x data. ```{r} mod = lm(trails_a~moca_score,data = set) bc = MASS::boxcox(mod) # you could find the unique value of lambda bc$x[which.max(bc$y)] # this is suggesting a natural log transformation, which doesn't change the answer all that much aside from resulting in a smaller p value broom::tidy(lm(trails_a~moca_score,data = set)) broom::tidy(lm(log(trails_a)~moca_score,data = set)) ```