credit: Rafael Irizarry
Modeling a response (output) variable as a function of several explanatory (input) variables: multiple regression for a continuous response, logistic regression for a binary response, and log-linear models for count data. Finding low-dimensional structure: principal components analysis. Cluster analysis.
This course emphasizes (i) applications of statistical methods such as multiple regression, binary regression, principal component analysis; (ii) the use of statistical software to do the computations; and (iii) interpretation of statistical analysis and output of statistical software. There is some linear algebra (with matrix representations) to show how multiple regression is computed in software, and there is some probability (mainly expected values, variances and covariances for linear combinations) to show how standard errors are determined for parameter estimates and predictions.
Your grade will be calculated according to the following weighting scheme: Labs 9%, Homeworks 21%, Midterm 30%, Final 40%. If the midterm exam is missed, its weight is transferred to the final exam.
We will be using custom course notes by Prof. Harry Joe, available at the UBC bookstore.
Lectures: 9:30am-11:00am Tuesday and Thursday, Earth Sciences Building 1012
Labs: Tuesdays and Fridays
Last day to withdraw without a W standing: January 17, 2018
Last day to withdraw with a W standing (course cannot be dropped after this date) : February 09, 2018
Midterm : March 01, 2018
Final : April 23, 2018
Section 2.1 Least Squares for one predictor
Residuals + Section 2.2 Statistical linear regression model
Section 2.2 (continued) + Section 2.5 Intervals for simple linear regression
Section 2.6 Explanation of Student t quantiles used in the interval estimates
Section 3.1 Least squares with two or more explanatory variables
Rcode for Section 2.8 Question 2.1
Rcode for 3d plot with surface
Section 3.6 + Section 3.7 + Section 3.8 + Section 3.9
Section 3.9 - Categorical explanatory variables
short youtube video on categorical variables in R
Section 3.9 (recap) + quadratic terms + Section 3.10 - Partial correlation
Note: there was originally a typo on slides 44-45 that has now been corrected.
Search online for “mtcars linear regression” and you will find some great material, for example:
https://rstudio-pubs-static.s3.amazonaws.com/193417_4e1f9d5b1c6f472885fc5b03df9d4331.html
http://rstudio-pubs-static.s3.amazonaws.com/20516_29b941670a4b42688292b4bb892a660f.html
Excellent website on Simpson’s paradox: http://vudlab.com/simpsons/
Section 3.11 - Multicollinearity (updated)
Note: Lecture 12 slides were originally posted with an error on slide 14. This has now been fixed.
Section 3.12 - Interpretations
Have a listen: podcast about causation
Have a read: Jagsi2012.pdf , Giuli2014.pdf , Haby2011.pdf, Promislow2002.pdf, Gupta2010.pdf, Fallowfield2002.pdf
Section 3.13 - Summary for multiple regression
Here is a list of suggested questions from Chapter 2 and Chapter 3:
Chapter 2: 2.1, 2.5, 2.6, and 2.8 (for scale changes/location shifts: 2.3, 2.4, and 2.10)
Chapter 3: 3.1, 3.8, 3.9, 3.10, and 3.11
NYT article on Funding for gun violence research , StarkShaw2017.pdf , gunresearchdata.csv , gunresearch.R
Youtube Video on Cross-Validation 1, Youtube Video on Cross-Validation 2
4.3 Additional diagnostics (Cook’s distance, influential observations)
Great website about diagnostics
Great tips for model diagnostics in R: https://www.statmethods.net/stats/rdiagnostics.html
4.4 Transforms and nonlinearity and 4.5 Diagnostics for data collected sequentially in time
Note for the interpretations of log-linear and linear-log models: the original lecture 17 slides were correct. My apologies for thinking that there were mistakes! Clearly, interpretation of these models can be difficult. I have now added some extra explanations and derivations on the slides. Let me know if you have any questions.
Good simple explanaitons on how to interpret models with log-transformations:
http://kenbenoit.net/assets/courses/ME104/logmodels2.pdf
6.2 Logistic regression (Part 1)
Good website that explains basics of logistic regression
6.2 Logistic regression (Part 2)
Good website that explains Likelihood for logistic regression
6.3 Logistic regression (Part 3) and Principal Component Regression (PCR)
PCR in the news: https://motherboard.vice.com/en_us/article/mg9vvn/how-our-likes-helped-trump-win
Data: wine_csv.csv
Data: carsub.csv
Each homework assignment is worth 3% of your final grade. Homework can be found and completed on https://webwork.elearning.ubc.ca/ “STAT306-201_2017W2”.
### Optional Questions Set 1
Consider data listing the prices and attributes of 465 randomly selected BCL products under 100$. Provided with the following data, BCL_data_available.csv , define a linear regression model (lm() object) with the goal of making accurate predictions. Your model will be used to predict the prices of 35 BCL products we have set aside in a secret holdout dataset.
The accuracy of the predictions will be based on RMSE. The outcome variable for your model should be “log(CURRENT_DISPLAY_PRICE)” and the predictor variables should be limited to those in the BCL_data_available dataframe and functions of those variables in the BCL_data_available dataframe.
Sample submission: 12345678.R
Please email your submission to stat306bclcontest@gmail.com
Please email any questions about submissions to Bo Chang (bchang@stat.ubc.ca)
Results:
new grading scheme:
RMSE <0.45 , 3.5pts ; RMSE 0.45-0.46 , 3pts ; RMSE 0.46-0.51 , 2.5pts ; RMSE 0.51-0.53 , 2pts ; RMSE 0.53-0.55 , 1.5pts ; RMSE <0.55 , 0pts .
Labs will focus on learning R. Each Lab Quiz (posted to Webwork) is worth 1% of your final grade. Lab quizes can be found and completed on Webwork https://webwork.elearning.ubc.ca/ “STAT306-201_2017W2”
data: Age_vs_Money_data.csv
Some R Basics; Reading in data; calculating sample statistics; t-test.
Lab Quiz 1 due January 18.
data: Age_vs_Money_data.csv, data: Hubble.txt
Simple plot of the data; Simple Linear Regression with lm(); linear regression with “Hubble” data set.
Lab Quiz 2 due January 25.
Lab Quiz 3 due February 1.
Lab Quiz due February 8.
data: moviegross.txt, Lab Quiz due February 15.
Variance Inflation factor (VIF) Signs of multicolinearity Adjusted R-squared Interaction terms
data: burnaby_condos.csv
Lab quiz due March 15.
Leave-one-out Cross validation. K-fold cross validation.
data: bwt.txt,
Lab quiz 8 due on March 22.
Logistic Regression, prediction and cross-validation.
Step by Step logistic regression example with R
data: newbie.txt,
Lab quiz 9 due on March 29.
Here is a copy of last year’s final exam. TA’s will go through some of the solutions during this lab. The final question is extra challenging and should get you thinking.
—>
You will need to download and install R (free and open-source): http://www.r-project.org/.
I recommend you also download and install RStudio, an excellent interface for programming in R: RStudio
If you are new to R and/or programming, I highly recommend spending a bit of time with this interactive tutorial. Take a look: http://tryr.codeschool.com/
To learn your first steps, I recommend: https://www.sitepoint.com/introduction-r-rstudio/
Swirl Interactive coding lessions in RStudio
If you learn by watching Youtube videos. Here is a series of two minute videos that explain many R functions.
http://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/7-t-tests
http://www.stat.columbia.edu/~martin/W2024/R2.pdf
https://www.seas.upenn.edu/~ese302/extra_mtls/Regression_Notes.pdf
UBC STATSPACE has incredible online interactive tutorials to help you understand important fundamentals.
Take a look and be amazed:
https://statspace.elearning.ubc.ca/sim-example.jsp#jump
Excellent explanation of multiple linear regression: http://mezeylab.cb.bscb.cornell.edu/labmembers/documents/supplement%205%20-%20multiple%20regression.pdf
UBC STATSPACE has videos online as well!
Take a look:
https://statspace.elearning.ubc.ca/video-example.jsp
Seeing Theory is a BEAUTIFUL website that visualizes the fundamental concepts of statistics.
Take a look:
http://students.brown.edu/seeing-theory/
“Linear Models with R” by Julian J. Faraway is an EXCELLENT book about linear regression and teaches R code as well. The second edition is available online from the UBC library.