By: Aurrel BhatiaAvatar. Titanic. Jaws. These are just a few of the hundreds of Hollywood movies that are produced with multi-million dollar budgets. Some of these movies do extremely well in the box office while others completely fail. Spending that much money on making a movie might be a bit of a gamble, leaving the directors and producers unsure if the movie revenues can even recover the original cost, much less make them a huge profit. In this research paper, I am attempting to solve this problem by creating a mathematical model based on the concepts of linear regression. The concept of single variable and multivariate linear regression is used extensively now in studies involving machine learning and artificial intelligence, but it was first developed by the French physicist Auguste Bravais in 1846, who developed what is now known as the correlation coefficient. Using his findings, Sir Francis Galton independently rediscovered and refined the concepts in 1888 and used it to demonstrate its application into the study of anthropology, heredity, and psychology.Galton later conducted a study on the probability of the extinction of surnames, which led to the Galton-Watson Stochastic Process, a model that is now a quintessential part of modern statistics and regression. Using these principles, Galton invented the use of the regression line “Works Cited #3.Galton’s concept of the regression line and of regression towards the mean came from his ideas in genetics during the late 19th century. Galton noticed that certain characteristics such as height are not always completely passed on to children from parents. Instead, the characteristics in the offspring return to the mean (or the average). Galton collected the height data for hundreds of people, then measured the regression to the mean and also estimated the size of the effect. Figure 1: Francis Galton’s 1875 diagram of the correlation between the heights of adults and their parents. The observations from this illustration suggested the concept of “regression toward the mean”, giving regression its name “Works Cited #8.Courtesy of :https://en.wikipedia.org/wiki/Linear_regression#/media/File:Galton%27s_correlation_diagram_1875.jpg In order to better understand these techniques, it is also necessary to understand the basic principles of linear regression. The study of linear regression is based on attempting to model the relationship between two variables by fitting a linear model to the observed data. One variable is considered to be an explanatory or independent variable, and the other is considered to be a scalar, or dependent variable “Works Cited” #12.In linear regression, the relationship between the variables is modeled using linear predictor functions. These types of models are called linear models. “Works Cited” #8. The unknown parameters can be estimated using the rest of the data.According to the pages 1-2 of the book Linear Regression Analysis: Theory and Computing (2009), linear regression is one of the oldest topics in statistics, dating back to about two hundred years ago “Works Cited” # 1. Linear regression was also the first type of regression to be studied extensively and used in many practical applications due to the fact that the relationship between two unknown parameters is linear, making the statistical properties easier to calculate. There are countless ways to use linear regression models in real life but the practical applications fall into three main categories. The first category is that linear regression can be used to identify the strength of the effect that the explanatory variable has on the dependent variable. Examples of such studies are what is the strength of the relationship between the age and income, amount of water and plant growth, customer satisfaction and loyalty, etc. Another way that linear regression analysis can be categorized is to forecast effects or impacts of changes. It helps understand how much the dependent variable will be changed with a change in the independent variable. Some examples of a real life problem using this method are “With X amount of dollars spent on marketing, the sales should be Y” or “With X cigarettes smoked per day, the life expectancy is Y years.”Finally, the third main category for using linear regression is to predict trends or future values. This method uses regression analysis to make estimates for future values by studying the existing data. Some real life examples using this method are “what will be the price of diamonds 6 months from now” or “by how many years does the life expectancy decrease for every additional pound overweight?” “Works Cited” #16The project conducted in this experiment will establish the use of the third method. The process used to solve the question begins by first gathering the necessary data which pertains to the topic. This data was found using StatCrunch “Works Cited” #5. Next, it is important to establish all possible independent and dependent variables from the data set. There are three possible independent variables to use (release day, release year, and the budget) and the one dependent variable is the worldwide gross or the revenue. After, the correlation between each independent and dependent variable must be calculated. This can be done by using the correlation coefficient.Figure 2: Graphs demonstrating sample correlation coefficient of 1, 0 and -1Courtesy of:http://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/ As shown by the graphs above, the correlation coefficient is a value between 1 and -1 that is used to express how strongly the two variables are related to each other. A correlation coefficient of +1 indicates a perfect positive correlation, meaning as one variable increases, the other will also increase. On the other hand, if there is a correlation coefficient of -1, it means that there is a perfect negative relationship and as one variable decreases, the other variable will also decrease. “Works Cited” #14. If the correlation coefficient is 0, it means that there is no linear relationship between the variables.Figure 3: The formula for the Pearson Correlation CoefficientCourtesy of:http://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/ In the above formula, r is the correlation coefficient, n is the sample size (100 in this case), and the Greek letter sigma (?) is another way of saying “sum of.” To find the correlation coefficient, 5 preliminary calculations pertaining to the data set must be made. “Works Cited” #15. The first calculation is the sum of all of the x values. The second calculation is the sum of all of the y values. After, the sum of the product of all the x and y values must be evaluated. The fourth calculation is the sum of the of the x2 values. The final preliminary calculation is the sum of all of the y2 values. After these calculations are made, it is fairly simple to just substitute the values in the above formula and find the correlation coefficient. Since the the data set was so large, and most of the values were in the millions, it would be extremely time consuming and impractical to work it by hand so the computer ran a program which generated the following values. The correlation coefficient for the relationship between the worldwide gross and the release day was -0.05111300801. This number is closer to 0 so there is an extremely small linear relationship between these two variables. The correlation coefficient for the worldwide gross and the release year was 0.06084186514. This is slightly better than the last one but is still closer to 0 so there is a very small relationship between these two variables. Finally, the correlation coefficient for the worldwide gross and the budget was 0.5659294675. Although it is not a perfect relationship, the movie budget will be used because it has been found to having the highest correlation between the two variables. Therefore, in this research topic, the independent variable is the movie budget and the dependent variable is worldwide gross.Figure 4: The scatter plots of the given data for each independent and dependent variable and was made using Excel. Data is Courtesy of:https://www.statcrunch.com/app/index.php?dataid=2188684 After choosing and plotting the data, finding the linear equation which best describes the data set is the next step. This linear equation is also known as a regression line. A linear regression line has an equation of the form Y = a + bX where X is the explanatory variable, Y is the dependent variable, b is the slope of the line and a is the y intercept (value of y when x is equal to 0) “Works Cited” #12. There can be an infinite number of different lines that can be drawn to connect the data points in this two-dimensional space, but the best fitting one is determined by using the linear equation that makes the sum of the squares of vertical distances of the data points from the line as small as possible. This is also known as minimizing the mean square error (MSE).Figure 5: Formula for minimizing the mean square error.Courtesy of:https://www.researchgate.net/figure/221515860_fig1_Figure-1-Mean-Squared-Error-formula-used-to-evaluate-the-user-model From the equation above, the variable “n” represents the number of data points, f(i) represents the value returned by the model, or the estimated value, and y(i) represents actual value for the data point i. The process of creating the perfect regression line is by using trial and error in this case. There can be other ways to calculate the equation, but this method worked best for a project like this where the numbers are very large and the data might be too complex to work out by hand. The computer forms the original regression line equation by taking the first two points in the data set. and using to create a line. This is known as the first iteration. The original equation is then used to evaluate existing points in the data set that have not already been used to create, or modify the line. The f(i) is the value which the equations predicts and it is subtracted by y(i), the existing value in the data set. Afterwards, the difference is squared, then divided by the total number of data points. The result of this equation is known as the mean square error. The number of iterations is equal to the total number of observations in the data set. Each time, new values are generated causing the a (y-intercept) and b (slope) to also adapt accordingly. There are usually several iterations, and the end goal is the have the lowest MSE possible or find the minimum value. The process of going through different trial