One way we can measure how well a model fits the data is by looking at the r2 value. r2 is a number between 0 and 1 that tells you how well the model can predict the data points. A high r2 value means the points are close to the line. Let’s explore how to find the line of best fit and see the connection between r2 and residuals. Follow your teacher’s instructions to complete this Desmos activity.
Math Objectives
Calculate the residual for a point using a linear regression model
Describe how OLS determines the line of best fit
Analyze the connection between residuals and r2 values
Interpret a regression model in context, including describing trends and identifying predicted and observed values
Common Core Math Standards
Link to all CCSS Math
Personal Finance Objectives
Analyze the relationship between college sticker price and student:faculty ratio.
Use regression models to summarize the relationship between college sticker price and another variable, including graduation rate, time, and ACT scores.
National Standards for Personal Financial Education
Earning Income
3a: Evaluate the costs and benefits of investing in additional education or training
Distribute to students
Math Objectives
Calculate the residual for a point using a linear regression model
Describe how OLS determines the line of best fit
Analyze the connection between residuals and r2 values
Interpret a regression model in context, including describing trends and identifying predicted and observed values
Common Core Math Standards
Link to all CCSS Math
Personal Finance Objectives
Analyze the relationship between college sticker price and student:faculty ratio.
Use regression models to summarize the relationship between college sticker price and another variable, including graduation rate, time, and ACT scores.
National Standards for Personal Financial Education
Earning Income
3a: Evaluate the costs and benefits of investing in additional education or training
Distribute to students
Three friends - Harper, Luke, and Jaxon - are trying to decide the best spot to meet up. Compare the two options based on how far each friend would need to travel to get there.
Cafe Mezze
Harper lives next door and is 0 miles away
Luke lives 15 miles away
Jaxon lives 3 miles away
Rabat Restaurant
Harper lives 6 miles away
Luke lives 5 miles away
Jaxon lives 7 miles away
Which option do you think they should choose? Why?
In the Intro, you compared how far the three friends were from a particular place. You can apply the same concept to a scatterplot with a line of best fit. The vertical distance between a data point and the line is called a residual. Just like the friends chose the restaurant that was best for all of them, we can use residuals to find the line that best fits the data points.
Let’s look at the example scatterplot below, which shows data on 6 colleges that Kelly is interested in. Kelly looked up the acceptance rate and average students grant aid for each college. Use the graph to answer the following questions.
What is the average student grant aid for the school that admits 10% of students? Estimate based on the data point.
What does Kelly’s line of best fit predict the average student grant aid will be for a school that admits 10% of students? Estimate based on the graph or by using the equation for the line of best fit: ŷ = -0.32x + 45.50.
What is the difference between those two values?
When choosing a college, one factor you might consider is the student:faculty ratio. This ratio tells you how many faculty members there are compared to the number of students. A lower student: faculty ratio might mean smaller class sizes, more opportunities to talk to professors, and more individualized support
The scatterplot below shows data on sticker price and student:faculty ratio for a representative sample of 27 4-year US colleges.
Sketch the line that you think best fits the data below.
Based on your line of best fit, what is the residual at x = 60?
Describe the correlation between the sticker price and student: faculty ratio. Consider: is it strong or weak? Positive or negative?
Compare your line with your classmate(s). How do you think we can figure out which lines are the BEST possible fit?
Residuals are a part of finding the line of best fit. That process is called Ordinary Least Squares (OLS) Regression.
OLS regression figures out which line is the BEST fit by finding the line that minimizes the sum of the squared residuals.
Let’s see how OLS regression works for our data on sticker price and student:faculty ratio. The graph on the left shows the residuals for all of the data points. The graph on the right squares each of those residuals. The best fit line makes the total area of those squares as small as possible.
Which data point has the residual with the greatest vertical distance? Approximate the coordinates.
The average annual sticker price of a 4-year college has increased by approximately $10,000 since 2000, adjusted for inflation.[2] Do you think that has caused a corresponding decrease in the average student:faculty ratio over the same time period? Why or why not?
Desmos Link: What did you learn from this activity?
Ethan is researching the relationship between a college’s sticker price of a college and graduation rate. He finds data for 35 4-year colleges and uses “sticker price” to represent the published annual tuition and fees for an out-of-state student.[1]
Ethan uses linear regression to find the line of best fit. Here are his results:
ŷ = 0.87x + 35.98
r^2 = 0.57
Describe the correlation between sticker price and graduation rate. Consider: is the correlation strong, weak, or nonexistent? Is the correlation positive or negative?
Ethan concludes, “The slope of the line of best fit is 0.87. That means that if a college increases their sticker price by $1000, it will cause their graduation rate to increase by 0.87 percentage points.” Is Ethan correct? Why or why not?
Level 2
Kiana is researching how the costs of college have changed over time. She finds data on the average annual cost of attending a 4-year college since 1970, adjusted for inflation. It includes the cost of tuition, required fees, housing, and food.
Kiana uses linear regression to find the line of best fit. Here are her results:
ŷ = 0.42x + 8.11
r^2 = 0.95
Based on her linear regression model, what is the predicted annual cost of college in 2030 (60 years after 1970)?
Based on this exponential model, what is the predicted cost of college in 2030 (after 60 years)? How does that compare to the cost predicted by the linear model in Question 1?
We can learn a lot more about our model from residuals. A residual plot is a graph that shows just the residuals from a regression model. Analyzing it can help you notice patterns and determine whether you’ve chosen the best type of function to fit the data. Watch the video and answer the questions.
For a point above the line of best fit…
The residual will be positive
The residual will be negative
The residual will be equal to zero
The residual will be equal to the predicted y value
The equation for this regression line is ŷ = -0.16x + 18. Which of the following statements about the slope is correct?
The slope is -0.16. This means the predicted student:faculty ratio decreases by 0.16 for every additional $1000 in tuition
The slope is -0.16. This means the predicted student:faculty ratio is the tuition multiplied by 0.16.
The slope is 18. This means the predicted student:faculty ratio increases by 18 for every additional $1000 in tuition.
The slope is 18. This means the average student:faculty ratio is 18.
What would happen to the r2 value if you removed the outlier at (6, 75). Circle your answer and explain your reasoning.
The r2 value would increase if you removed the outlier
The r2 value would decrease if you removed the outlier
The r2 value would stay the same if you removed the outlier
Based on the exponential model, how much are college costs increasing each year?
Which model do you think best describes how college costs have changed over time? Explain your reasoning.
If a data point has a residual of -4, that means…
a. The data point is above the line of best fit
b. The data point is below the line of best fit
c. The data point is on the line of best fit
d. The data point is an outlier
Here are the residual plots for four different regression models. Based on the residuals, which model do you think is the best fit? Explain your reasoning.
How does college sticker price relate to students’ ACT scores? In this activity, you will explore real-world data and determine which model fits best. Follow your teacher’s instructions to complete this Desmos activity.