AHMADU BELLO UNIVERSITY
AHMADU BELLO UNIVERSITY, ZARIA, NIGERIA
DISTANCE LEARNING CENTRE
BUAD 837 (Module 2)
BUAD 837 Quantitative Methods for Managers
ASSIGNMENT QUESTION: Discuss the uses and assumptions of linear regression analysis
BY: GROUP B
• Bariyu Muritala IFENAIKE – ABUMBA02015001976 – 08023466667
• OKOSUN EMMANUEL ODIANOSEN – ABUMBA02015002464 – 07035396267
• OKOYE EMEKA PATRICK – ABUMBA02015006157 – 07038924937
• IJOGUN AYOKUNLE – ABUMBA02015000578 – 07030294989
• Jolaoso Gbenga Victor – ABUMBA02015004070 – 08063270440
Linear regression analysis is the most widely used of all statistical techniques: it is the study of linear, additive relationships between variables. Let Y denote the “dependent” variable whose values you wish to predict, and let X1, …,Xk denote the “independent” variables from which you wish to predict it, with the value of variable Xi in period t (or in row t of the data set) denoted by Xit. Then the equation for computing the predicted value of Yt is:
This formula has the property that the prediction for Y is a straight-line function of each of the X variables, holding the others fixed, and the contributions of different X variables to the predictions are additive. The slopes of their individual straight-line relationships with Y are the constants b1, b2, …, bk, the so-called coefficients of the variables. That is, bi is the change in the predicted value of Y per unit of change in Xi, other things being equal. The additional constant b0, the so-called intercept, is the prediction that the model would make if all the X’s were zero (if that is possible). The coefficients and intercept are estimated by least squares, i.e., setting them equal to the unique values that minimize the sum of squared errors within the sample of data to which the model is fitted. And the model’s prediction errors are typically assumed to be independently and identically normally distributed.
THE ASSUMPTIONS OF LINEAR REGRESSION AND IT ANALYSIS
Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable. The regression has five key assumptions:
1. Linear relationship
Linear relationship is a statistical term used to describe the relationship between a variable and a constant. Linear relationships can be expressed either in a graphical format where the variable and the constant are connected via a straight line or in a mathematical format where the independent variable is multiplied by the slope coefficient, added by a constant, which determines the dependent variable.
2. Multivariate normality
The multivariate normal (MV-N) distribution is a multivariate generalization of the one-dimensional normal distribution. In its simplest form, which is called the “standard” MV-N distribution, it describes the joint distribution of a random vector whose entries are mutually independent univariate normal random variables, all having zero mean and unit variance. In its general form, it describes the joint distribution of a random vector that can be represented as a linear transformation of a standard MV-N vector.
It is a common mistake to think that any set of normal random variables, when considered together, form a multivariate normal distribution. This is not the case. In fact, it is possible to construct random vectors that are not MV-N, but whose individual elements have normal distributions. The latter fact is very well-known in the theory of Copulae (a theory which allows to specify the distribution of a random vector by first specifying the distribution of its components and then linking the univariate distributions through a function called copula).
The remainder of this lecture illustrates the main characteristics of the multivariate normal distribution, dealing first with the “standard” case and then with the more general case.
3. No or little multicollinearity
Multicollinearity generally occurs when there are high correlations between two or more predictor variables. In other words, one predictor variable can be used to predict the other. This creates redundant information, skewing the results in a regression model. Examples of correlated predictor variables (also called multicollinear predictors) are: a person’s height and weight, age and sales price of a car, or years of education and annual income.
An easy way to detect multicollinearity is to calculate correlation coefficients for all pairs of predictor variables. If the correlation coefficient, r, is exactly +1 or -1, this is called perfect multicollinearity. If r is close to or exactly -1 or +1, one of the variables should be removed from the model if at all possible.
It’s more common for multicollineariy to rear its ugly head in observational studies; it’s less common with experimental data. When the condition is present, it can result in unstable and unreliable regression estimates. Several other problems can interfere with analysis of results, including:
• The t-statistic will generally be very small and coefficient confidence intervals will be very wide. This means that it is harder to reject the null hypothesis.
• The partial regression coefficient may be an imprecise estimate; standard errors may be very large.
• Partial regression coefficients may have sign and/or magnitude changes as they pass from sample to sample.
• Multicollinearity makes it difficult to gauge the effect of independent variables on dependent variables.
4. No auto-correlation
Autocorrelation is a mathematical representation of the degree of similarity between a given time series and a lagged version of itself over successive time intervals. It is the same as calculating the correlation between two different time series, except that the same time series is actually used twice: once in its original form and once lagged one or more time periods.
Autocorrelation can also be referred to as lagged correlation or serial correlation, as it measures the relationship between a variable’s current value and its past values. When computing autocorrelation, the resulting output can range from 1.0 to negative 1.0, in line with the traditional correlation statistic. An autocorrelation of +1.0 represents a perfect positive correlation (an increase seen in one time series leads to a proportionate increase in the other time series). An autocorrelation of negative 1.0, on the other hand, represents perfect negative correlation (an increase seen in one time series results in a proportionate decrease in the other time series). Autocorrelation measures linear relationships; even if the autocorrelation is minuscule, there may still be a nonlinear relationship between a time series and a lagged version of itself.
This assumption means that the variance around the regression line is the same for all values of the predictor variable (X). The plot shows a violation of this assumption. For the lower values on the X-axis, the points are all very near the regression line. For the r values on the X-axis, there is much more variability around the regression line.
WHAT IS REGRESSION ANALYSIS?
Regression analysis is a statistical technique used to find the relations between two or more variables. In regression analysis one variable is independent and its impact on the other dependent variables is measured. When there is only one dependent and independent variable we call is simple regression. On the other hand, when there are many independent variables influencing one dependent variable we call it multiple regression.
Let’s understand it with a simple example. Suppose you have a lemonade business. A simple linear regression could mean you finding a relationship between the revenue and temperature, with revenue as the dependent variable. In case of multiple variable regression, you can find the relationship between temperature, pricing and number of workers to the revenue. Thus, regression analysis can analyze the impact of varied factors on business sales and profits. Here some applications of regression analysis in business:
HOW BUSINESSES USE REGRESSION ANALYSIS STATISTICS
Regression analysis is a statistical tool used for the investigation of relationships between variables. Usually, the investigator seeks to ascertain the causal effect of one variable upon another. The effect of a price increase upon demand, for example, or the effect of changes in the money supply upon the inflation rate.
Regression analysis is used to estimate the strength and the direction of the relationship between two linearly related variables: X and Y. X is the “independent” variable and Y is the “dependent” variable.
The two basic types of regression analysis are:
• Simple regression analysis:
This is used to estimate the relationship between a dependent variable and a single independent variable; for example, the relationship between crop yields and rainfall.
• Multiple regression analysis:
It is used to estimate the relationship between a dependent variable and two or more independent variables; for example, the relationship between the salaries of employees and their experience and education.
Multiple regression analysis introduces several additional complexities but may produce more realistic results than simple regression analysis.
Regression analysis is based on several strong assumptions about the variables that are being estimated. Several key tests are used to ensure that the results are valid, including hypothesis tests. These tests are used to ensure that the regression results are not simply due to random chance but indicate an actual relationship between two or more variables.
An estimated regression equation may be used for a wide variety of business applications, such as:
• Measuring the impact on a corporation’s profits of an increase in profits
• Understanding how sensitive a corporation’s sales are to changes in advertising expenditures
• Seeing how a stock price is affected by changes in interest rates
Regression analysis may also be used for forecasting purposes; for example, a regression equation may be used to forecast the future demand for a company’s products.
Due to the extreme complexity of regression analysis, it is often implemented through the use of specialized calculators or spreadsheet programs.
MAJOR ASSUMPTIONS OF LINEAR REGRESSION
There are seven major assumptions of linear regression, these are:
• The relationship between all X’s and Y is linear. Violation of this assumption leads to changes in regression coefficient (B and beta) estimation.
• All necessary independent variables are included in the regression that are specified by existing theory and/or research. This assumption serves the purpose of saving us from making large claims when we simply have failed to account for (a) better predictor(s) which may share variance with other predictors. Obviously, the regression interpretation becomes extremely difficult and timely once too many variables are added to the regression. We must balance this with accurate representations of population-level phenomena in our model, so usually we would like to test more than one predictor if possible.
• The reliability of each of our measures is 1.0. This assumption is almost always violated, particularly in psychology and other fields that generate their own scales, including measures of self-report. Typically, a reasonable cutoff is considered a reliability of .7. This will typically downward bias (make smaller) the regression coefficients in the model. There is constant variance across the range of residuals for each X (this is sometimes referred to as homocedasticity, whereas violations are termed heteroscedastic).
• Residuals are independent from one another. Residuals cannot be associated for any subgroup of observations. This assumption will always be met when using random sampling in tandem with cross-sectional designs. This refers to dependencies in the error structure in of model. In other words, there can be no data dependencies, such as that which would be introducted from nested designs. Multilevel modeling is a more appropriate generalized form of regression which is able to handle dependencies in model error structures.
• Residuals are normally distributed.
• There is no multicollinearity (a very high correlation) between X predictors in the model. Violation of this assumption reduces the amount of unique variance in X that can explain variance in Y and makes interpretation difficult.
JUSTIFICATION FOR REGRESSION ASSUMPTIONS
Why should we assume that relationships between variables are linear?
1. Because linear relationships are the simplest non-trivial relationships that can be imagined (hence the easiest to work with).
2. Because the “true” relationships between our variables are often at least approximately linear over the range of values that are of interest to us.
3. Even if they’re not, we can often transform the variables in such a way as to linearize the relationships.
This is a strong assumption, and the first step in regression modeling should be to look at scatterplots of the variables (and in the case of time series data, plots of the variables vs. time), to make sure it is reasonable a priori. And after fitting a model, plots of the errors should be studied to see if there are unexplained nonlinear patterns. This is especially important when the goal is to make predictions for scenarios outside the range of the historical data, where departures from perfect linearity are likely to have the biggest effect. If we see evidence of nonlinear relationships, it is possible (though not guaranteed) that transformations of variables will straighten them out in a way that will yield useful inferences and predictions via linear regression.
And why should we assume the errors of linear models are independently and identically normally distributed?
1. This assumption is often justified by appeal to the Central Limit Theorem of statistics, which states that the sum or average of a sufficiently large number of independent random variables–whatever their individual distributions–approaches a normal distribution. Much data in business and economics and engineering and the natural sciences is obtained by adding or averaging numerical measurements performed on many different persons or products or locations or time intervals. Insofar as the activities that generate the measurements may occur somewhat randomly and somewhat independently, we might expect the variations in the totals or averages to be somewhat normally distributed.
2. It is (again) mathematically convenient: it implies that the optimal coefficient estimates for a linear model are those that minimize the mean squared error (which are easily calculated), and it justifies the use of a host of statistical tests based on the normal family of distributions. (This family includes the t distribution, the F distribution, and the Chi-square distribution.)
3. Even if the “true” error process is not normal in terms of the original units of the data, it may be possible to transform the data so that your model’s prediction errors are approximately normal.
But here too caution must be exercised. Even if the unexplained variations in the dependent variable are approximately normally distributed, it is not guaranteed that they will also be identically normally distributed for all values of the independent variables. Perhaps the unexplained variations are larger under some conditions than others, a condition known as “heteroscedasticity”. For example, if the dependent variable consists of daily or monthly total sales, there are probably significant day-of-week patterns or seasonal patterns. In such cases the variance of the total will be larger on days or in seasons with greater business activity–another consequence of the central limit theorem.
CORRELATION AND SIMPLE REGRESSION FORMULAS
A variable is, by definition, a quantity that may vary from one measurement to another in situations where different samples are taken from a population or observations are made at different points in time. In fitting statistical models in which some variables are used to predict others, what we hope to find is that the different variables do not vary independently (in a statistical sense), but that they tend to vary together.
In particular, when fitting linear models, we hope to find that one variable (say, Y) is varying as a straight-line function of another variable (say, X). In other words, if all other possibly-relevant variables could be held fixed, we would hope to find the graph of Y versus X to be a straight line (apart from the inevitable random errors or “noise”).
A measure of the absolute amount of variability in a variable is (naturally) its variance, which is defined as its average squared deviation from its own mean. Equivalently, we can measure variability in terms of the standard deviation, which is defined as the square root of the variance. The standard deviation has the advantage that it is measured in the same units as the original variable, rather than squared units.
Our task in predicting Y might be described as that of explaining some or all of its variance–i.e., why, or under what conditions, does it deviate from its mean? Why is it not constant? That is, we would like to be able to improve on the naive predictive model: ?t = CONSTANT, in which the best value for the constant is presumably the historical mean of Y. More precisely, we hope to find a model whose prediction errors are smaller, in a mean square sense, than the deviations of the original variable from its mean.
USES OF REGRESSION ANALYSIS IN BUSINESS
Following are the uses of regression analysis as a tools use in statistical analysis in
1. Predictive Analytics:
Predictive analytics i.e. forecasting future opportunities and risks is the most prominent application of regression1analysis in business. Demand analysis, for instance, predicts the number of items which a consumer will probably purchase. However, demand is not the only dependent variable when it comes to business. Regression analysis can go far beyond forecasting impact on direct revenue. For example, we can forecast the number of shoppers who will pass in front of a particular billboard and use that data to estimate the maximum to bid for an advertisement. Insurance companies heavily rely on regression analysis to estimate the credit standing of policyholders and a possible number of claims in a given time period.
2. Operation Efficiency:
Regression models can also be used to optimize business processes1. A factory manager, for example, can create a statistical model to understand the impact of oven temperature on the shelf life of the cookies baked in those ovens. In a call center, we can analyze the relationship between wait times of callers and number of complaints. Data-driven decision making eliminates guesswork, hypothesis and corporate politics from decision making. This improves the business performance by highlighting the areas that have the maximum impact on the operational efficiency and revenues.
3. Supporting Decisions:
Businesses today are overloaded with data on finances, operations and customer purchases. Increasingly, executives are now leaning on data analytics to make informed business decisions thus eliminating the intuition and gut feel. Regression analysis can bring a scientific angle to the management of any businesses. By reducing the tremendous amount of raw data into actionable information, regression analysis leads the way to smarter and more accurate decisions. This does not mean that regression analysis is an end to managers creative thinking. This technique acts as a perfect tool to test a hypothesis before diving into execution.
4. Correcting Errors:
Regression is not only great for lending empirical support to management decisions but also for identifying errors in judgment. For example, a retail store manager may believe that extending shopping hours will greatly increase sales. Regression analysis, however, may indicate that the increase in revenue might not be sufficient to support the rise in operating expenses due to longer working hours (such as additional employee labor charges). Hence, regression analysis can provide quantitative support for decisions and prevent mistakes due to manager’s intuitions.
5. New Insights:
Over time businesses have gathered a large volume of unorganized data that has the potential to yield valuable insights. However, this data is useless without proper analysis. Regression analysis techniques can find a relationship between different variables by uncovering patterns that were previously unnoticed. For example, analysis of data from point of sales systems and purchase accounts may highlight market patterns like increase in demand on certain days of the week or at certain times of the year. You can maintain optimal stock and personnel before a spike in demand arises by acknowledging these insights.
Regression analysis as a statistical tool used for the investigation of relationships between variables. Therefore, the investigators utilize it to seeks and to ascertain the causal effect of one variable upon another. Regression analysis is used to estimate the strength and the direction of the relationship between two linearly related variables: X and Y. X is the “independent” variable and Y is the “dependent” variable. It uses is based on several strong assumptions about the variables that are being estimated. Several key tests are used to ensure that the results are valid, including hypothesis tests. These tests are used to ensure that the regression results are not simply due to random chance but indicate an actual relationship between two or more variables. Many at times, may also be used for forecasting purposes; for example, a regression equation may be used to forecast the future demand for a company’s products, improve productivity and profitability in business dealings.
Barlow, Jesse L. (1993). “Chapter 9: Numerical aspects of Solving Linear Least Squares Problems”. In Rao, C.R. Computational Statistics. Handbook of Statistics. 9. North-Holland. ISBN 0-444-88096-8.
Bollen, Kenneth A.; Jackman, Robert W. (1990). Fox, John; Long, J. Scott, eds. Regression Diagnostics: An Expository Treatment of Outliers and Influential Cases. Modern Methods of Data Analysis. Newbury Park, CA: Sage. pp. 257-291.
Cohen, J., Cohen P., West, S.G., ; Aiken, L.S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates.
Draper, N.R.; Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley. ISBN 0-471-17082-8.
Mathieu Rouaud, 2013: Probability, Statistics and Estimation Chapter 2: Linear Regression, Linear Regression with Error Bars and Nonlinear Regression.
McDonald, G. C. (2009). Ridge regression. Wiley Interdisciplinary Reviews: Computational Statistics, 1(1), 93-100.
Raudenbush, S. W., ; Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (Vol. 1). Sage. Chicago.