Dr. Larry Genalo
02-24-2013
Prepared by: Walter Bennette
Statistics- parameters calculated from sample observations (engineers typically deal with samples rather than the total population)
Examples:
Measure of AVERAGE (Central Tendency)
Mode: The most common value in a sample
Median: “Middle” value in the sample
Mean: Arithmetic average = \( \bar X = \frac{\sum_{i=1}^N X_i}N \)
Measures of Variation (scatter, dispersal) about the mean:
\[ Variance = S^2 = \frac{\sum_{i = 1}^N ( X_i - \bar X )^2 }{N - 1} \]
\[ Standard \ Deviation = S = \sqrt{S^2} \]
Note: this equation for variance (using N-1) is used for a small sample of a population
An alternative equation to find the standard deviation is the following:
\[ Standard \ Deviation = S = \sqrt{\frac{N \sum (X_i^2) - (\sum X_i)^2}{N(N-1)}} \]
i | 1 | 2 | 3 | 4 | 5 | 6 |
\( X_i \) | 92 | 63 | 74 | 63 | 78 | 78 |
It makes these specific calculations easier if the numbers are reordered
i' | 1 | 2 | 3 | 4 | 5 | 6 |
\( X_{i'} \) | 63 | 63 | 74 | 78 | 78 | 92 |
\( Mode \) \( = \) \( 63 \) \( and \) \( 78 \)
\( Median = \frac{(78+74)}2 = 76 \)
i' | 1 | 2 | 3 | 4 | 5 | 6 |
\( X_{i'} \) | 63 | 63 | 74 | 78 | 78 | 92 |
\( Mean=\bar X = \frac{\sum_{i=1}^N X_i}N \)
\( \bar X = \frac {( 63 + 63 + 74 + 78 +78 +92 )}6 = 74.67 \)
i' | 1 | 2 | 3 | 4 | 5 | 6 |
\( X_{i'} \) | 63 | 63 | 74 | 78 | 78 | 92 |
\( Standard \ Deviation = S = \sqrt{\frac{N \sum (X_i^2) - (\sum X_i)^2}{N(N-1)}} \)
\( S = \sqrt{\frac{6 (63^2 + ... + 92^2) - (63 + ... + 92)^2}{6(5)}} = 10.91 \)
i' | 1 | 2 | 3 | 4 | 5 | 6 |
\( X_{i'} \) | 63 | 63 | 74 | 78 | 78 | 92 |
\( Variance = S^2 \)
\( S^2 = (10.911767)^2 = 119.1 \)
i | 1 | 2 | 3 | … | 48 |
\( X_i \) | 85 | 93 | 72 | … | 5 |
Freq of Occur. |
Grade Class |
---|---|
8 | 90-100 |
13 | 80-89 |
15 | 70-79 |
8 | 60-69 |
4 | 0-59 |
This can lead to a Continuous Distribution such as:
This is an example of a theoretical distribution called the Normal Distribution or a “Bell-Shaped” curve. This is used when grading “on a curve” and also applies to many natural phenomena.
Given a set of data points “model” them with a curve which approximates their behavior.
We will look only at a straight line fit: Linear Regression by the method of Least Squares.
The problem:
We know that a straight line has the form \( Y=mX+b \)
What values of \( m \) and \( b \) are the “best”?
The red lines indicate how far the fitted line is from the true data points.
Now calculate the sum of the squares of all the “misses”
\[ Sum= (Y_1 - (mX_1 + b))^2 + (Y_2 - (mX_2 + b))^2 + ...\\+...+ (Y_N - (mX_N + b))^2 \]
Note: All \( X_i \) and \( Y_i \) are known.
This becomes a calculus problem to select \( m \) and \( b \) that minimizes the sum.
Calculate partial derivatives (pardon?), setting them equal to zero and solving.
\[ \bbox[5px, border:2px solid black]{m = \frac{N\sum_{i=1}^N X_i Y_i - \sum_{i=1}^N X_i \sum_{i=1}^N Y_i}{N \sum_{i=1}^N (X_i^2)-(\sum_{i=1}^N X_i)^2}} \]
\[ \bbox[5px, border:2px solid black]{b=\frac{\sum_{i=1}^N Y_i - m\sum_{i=1}^N X_i}{N}} \]
Given ANY set of data points, we can find \( m \) and \( b \). Is it always a good idea to do this?
Given ANY set of data points, we can find \( m \) and \( b \). Is it always a good idea to do this?
\[ R= \frac{\sum_{i=1}^N \left [(X_i- \bar X)(Y_i- \bar Y) \right ]}{\left [\sum_{i=1}^{N}(X_i- \bar X)^2 \sum_{i=1}^{N}(Y_i- \bar Y)^2 \right ]^ { \frac{1}{2} }} \]
\[ R= \frac{\sum_{i=1}^N \left [(X_i- \bar X)(Y_i- \bar Y) \right ]}{\left [\sum_{i=1}^{N}(X_i- \bar X)^2 \sum_{i=1}^{N}(Y_i- \bar Y)^2 \right ]^ { \frac{1}{2} }} \]
\( -1 \le R \le 1 \)
Note: an alternate equation for R is found as shown here
\[ \bbox[5px, border:2px solid black]{ R = \frac{n \left (\sum{x_i y_i} \right ) - \left (\sum x_i \right) \left (\sum y_i \right)}{\sqrt {n \left (\sum x_i^2 \right )- \left (\sum x_i \right )^2} \sqrt {n \left (\sum y_i^2 \right )- \left (\sum y_i \right)^2}} } \]
Anytime you do a linear regression (or other types of curve-fitting) you should check the correlation coefficient (or other goodness-of-fit parameters)
\[ Is \ R=0.9 \ good \ enough? \]
\[ How \ about \ R=0.8 ? \ 0.7? \]
Deciding if R is “good enough” is an interpretive skill developed through experience.
If \( R=0.8 \), then 64% of the variance in the dependent variable is explainable by the variance in the independent variable. Note: \( \ R^2 = 0.64 \)
“Patterned” data not following a straight line?
Model with a curved line
Transform to a straight line model
Then,
\( V=bT^m \ \ \ \) ——> \( \ \ \log \ V=log \ b+ m \log \ T \)
i.e., Let \( Y=log \ V \) and \( X = log \ T \)
Now do a linear regression on the variables log T and log V. Note that the intercept derived is log b (NOT b)
Then,
\( V=be^{mt} \ \ \ \) ——> \( \ \ \log \ V=log \ b+ m t \)
i.e., Let \( Y=log \ V \) and \( X = t \)
Now do a linear regression on the variables t and log V. Note that the intercept derived is log b (NOT b)