Slider

Statistical Methods in Decision Making - Exploring the Relationship Between Height and Weight: A Comprehensive Analysis

 Questions:

  1. How can we check if X = height and Y=weight are linearly related? 
  2. What is the equation of the best fit line? 
  3. Explain how we can test if the slope and intercept are significant? 
  4. What does the T Test signify in regression?
  5. Plot the residuals and explain your findings.
  6. What are the assumptions of regression?
  7. Predict the value of Y when X = 25 inches

Data Set:

Height(inches) / Weight(Lbs)

53 140.5
54 143
54 156
54 144
54 142
55 162.5
56 162
57 166.5
57 143
58 165
59 157.5
59 161.5
60 170
61 173.5
62 161
64 166
65 138
65 174.5
67 195
67 181.5
67 184
67 173.5
68 240
68 176
70 135
71 200.5
73 145
73 196.5
73 162
74 210




Q1: How can we check if X = height and Y=weight is linearly related?

Answer:
Visualizing the Data through scatter plot: 







Referring Excel sheet for regression statistics, we observe value for R2 = 0.28



Conclusion: Determining the strength and direction of the linear relationship between the Height and Weight based on the correlation coefficient value (r = 0.5254), we can simplify the interpretation as below: The correlation coefficient (r) ranges from -1 to +1. Since the value of r is positive (0.5254), it indicates a positive direction for the linear relationship. In other words, as height increases, weight also tends to increase. The value of r (0.5254) suggests that the linear relationship between height and weight is neither weak nor strong. It falls in the middle, indicating a moderate correlation. In conclusion, Height and Weight have a moderately positive linear relationship. As height increases, weight generally increases.

Q2: What is the equation of the best fit line?

Answer:
To find the equation of the best fit line for the data, we use linear regression, It helps us find a line that comes closest to all the points.
The equation is in the form

 y = mx + b.

➔ Y is the predicted weight
➔ X is the height, 'm' is the slope (how weight changes with height), and 'b' is the y-intercept (weight when height is zero).
➔ The regression model calculates m and b using the data points to get the best fit line.



This makes the above equation:
 m (value of slope) = 1.84
 b (y-intercept) = 51.902

y = 1.84x + 51.902


Q3: Explain how we can test if the slope and intercept are significant?

Answer: We can test the significance of the slope and intercept in a linear regression model using a t-test. The t-test helps determine if the estimated coefficients are significantly different from zero, indicating their statistical significance.
Data from Excel estimating the slope (m) and intercept (b) values



m (slope) = 1.84
b (y-intercept) = 51.902
p value = 0.0029 

For Slope: Null hypothesis,
H₀ = The slope of the regression line for Height is zero. 

Alternate hypothesis, 
H1 = The slope of the regression line for Height is not zero and bears a significant relation between height and weight.

Conclusion from data:
The p-value of 0.0029 indicates that there is strong evidence to reject the null hypothesis, which states that there is no significant relationship` between Height (inches) and Weight (lbs). This means that the relationship between height and weight is likely to be statistically significant. Therefore, based on this result, we can conclude that there is a meaningful relationship between height and weight.

For Intercept: Null hypothesis, H₀ = The intercept is not statistically significant, meaning that it does not have a meaningful impact on the relationship between the variables.

Alternate hypothesis, H1 = The intercept is statistically significant, suggesting that it does have a meaningful impact on the relationship between the variables.

Conclusion from data:
Based on the results obtained, the p-value for testing the significance of the intercept is 0.155. Since this p-value is higher than the usual significance level of 0.05, We cannot reject the null hypothesis. This means that there is not enough evidence to suggest that the intercept is statistically significant. In simpler terms, the intercept does not seem to have a significant impact on the relationship between the variables based on the available data. 

Q4: What does the T Test signify in regression?

Answer: Observing T test in regression;


Stating the Hypothesis:
The average expected increase in weight for each inch increase in height is estimated to be 1.84 lbs.
Null hypothesis (H₀): The slope for height is equal to zero (m = 0).
Alternative hypothesis (H₁): The slope for height is not equal to zero (m ≠ 0).

Determining the significance level: α @ 0.05, representing a 95% confidence interval.



Calculating the t-value by dividing the coefficient (1.84) by the standard error (0.563).
The calculated t-value is 3.268.

Conclusion based on the p-value:
The corresponding p-value for the calculated t-value (3.268) with degrees of freedom (df) = 28 is 0.00286751.
Since the p-value (0.00286751) is less than the chosen significance level (0.05),
we reject the null hypothesis (H₀: m = 0).

Therefore, we could state as below:

Based on the T-test, we conclude that height is statistically significant in the model. The relationship between height and weight is strong and Height significantly contributes to the accuracy of the model.


Q5: Plot the residuals and explain your findings.

Answer:

Referred value:
m (slope) = 1.840027
b (y intercept) 51.90161 
Predicted value = mx+b = 1.84*x+51.9 
Residual = observed value – predicted value for y

Based on same, tabled the values as below:


Below is the Scatter plot based on above residual values vs height.:




The table above displays the residuals, which indicate the differences between the predicted values and the observed values or the regression line. By examining the table, we can make the following observations:

1. For values between 150 and 170, the residuals tend to fall within the range of -10 to 10 for most of the values. This suggests that the model is fairly accurate and the predicted values align closely with the observed values or the regression line within this range.

2. However, for values greater than 170, the range of the residuals starts to increase. This indicates that the model becomes less linear and less accurate as the values increase. The relationship between the variables may not follow a straight-line pattern as strongly in this higher range.

3. On the other hand, in the lower ranges, the residuals remain relatively stable and show a somewhat linear relationship. This suggests that the model performs better and maintains a closer alignment between the predicted and observed values within these lower ranges.

4. In summary, the analysis of the residuals indicates that the model's accuracy and linearity vary depending on the value range. It appears to be more stable and linearly related in the lower ranges, but less accurate and less linear as the values increase. 

From scatter plot, its conclusive: As the values increase, the residuals also increase, indicating a weaker linear relationship between the variables. This is evident from the residual plot, which shows the scattered pattern of the residuals rather than a clear and consistent linear pattern.

Q6: What are the assumptions of regression?

Answer:

Simple Linear Relationship: A simple linear relationship implies a direct association between two variables that can be represented by a straight line on a scatter plot. It follows the equation: Y = mX + b, where Y is the dependent variable (e.g., weight), X is the independent variable (e.g., height), m is the slope of the line (indicating the rate of change of Y with respect to X), and b is the y-intercept (the value of Y when X is zero). This assumption suggests that the relationship between the variables can be adequately captured by a linear equation, such as a regression line.

Correlation between Variables: The correlation coefficient (r) measures the strength and direction of the relationship between two variables. In the context of height and weight, the correlation coefficient ranges from -1 to +1. A positive correlation coefficient suggests a positive linear relationship, indicating that taller individuals tend to have higher weights. A negative correlation coefficient would imply the opposite.

The correlation coefficient can be calculated using the formula:

r = (Σ((X-X̄)(Y-Ȳ)) / √(Σ(X-X̄)² * Σ(Y-Ȳ)²)

where X and Y represent the data points and X̄and Ȳ denote their respective means.

Variables are Independent:
The assumption of independence implies that the observations or data points in the dataset are not influenced by each other. In the case of height and weight, it means that the weight of one individual should not be affected by the height or weight of another individual. Each observation should represent a unique individual, and their heights and weights should not be influenced by external factors or the values of other individuals in the dataset.

Normally Distributed:
The assumption of normal distribution pertains to the shape of the data. It assumes that the distribution of the data points follows a bell-shaped curve, with most observations concentrated around the mean, and fewer observations in the tails. This assumption is related to the residuals in regression analysis, which should follow a normal distribution. The residuals can be obtained by subtracting the predicted values from the observed values. The normal distribution assumption allows for the application of statistical tests and enables accurate estimation and inference from the regression model.

By considering these assumptions, including a simple linear relationship, correlation between variables, independence of observations, and normal distribution, we can establish the foundation for reliable regression analysis. These assumptions, along with the associated mathematical formulas, help ensure the validity and accuracy of the regression model.

Q7: Predict the value of Y when X = 25 inches?

Answer:

m (slope) = 1.84 b (y-intercept) = 51.902
Standard error = 20.58
y = m+ bx ± standard error

  •  Y = 51.902 + 1.84x ± 20.58 
  • Since value of x = 25 
  • Y= 51.902 + 46 ± 20.58
    Y= 97.902 ± 20.58
    i.e. Value of Y should be between 118.482 and 77.322


    Linear Regression Analysis#Height-Weight Correlation#Statistical Significance Testing#Regression Assumptions#Predictive Modelling#Data Visualization#Residual Analysis#T-Test in Regression#Best Fit Line Equation#Correlation Coefficients#Statistical Methods#Decision Making in Statistics#Predictive Analytics#Scatter Plot Analysis
0

No comments

Post a Comment

blogger
© all rights reserved
made with @Adarsh