A judicious man looks on statistics not to get knowledge, but to save himself from having ignorance foisted on him.
Thomas Carlyle

Schedule for Math 10: Introductory Statistics

We will reorder the book somewhat, and I am adding a bit of supplemental material.

Jump to links for class on: April 8  ◊  April 12  ◊  April 14  ◊  April 16  ◊  April 23  ◊  May 5  ◊  May 19/20

Part IV: Probability

The Multiplication Rule for counting (see also Chp 14 Sec 1): Suppose we do an experiment that has parts A and B. If there are n outcomes to part A, and for each of those outcomes there are m outcomes to part B, then the total number of outcomes for the experiment is nm. This extends to more than two parts, as well.

Example: If our experiment is to roll a die and then toss a coin, then there are 6•2 = 12 outcomes to the experiment.

Chapter 13 resources:

Illustration of a full deck of 52 cards (scroll down), for those unfamiliar with them.

Image of dice in case anyone needs it. Our dice will always be six-sided, bearing the values 1 through 6.

»   Quiz 1 (Apr 2) will include a question about the course webpage. It will also cover all of Chapter 13, including supplemental exercises.       Answers (pdf)

This definition and the following rules extend Sec 2. Given two sets A and B, the union A∪B is the set of all objects that belong to at least one of the sets, and the intersection A∩B is the set of all objects that belong to both sets.

Inclusion-Exclusion Principle for counting: the size of A∪B is the size of A plus the size of B minus the size of A∩B.

In probability, we may refer to the event A∪B as the event "A or B" and to A∩B as the event "A and B" or "A&B".
Inclusion-Exclusion Principle for probability: The probability of the event "A or B" is the probability of A plus the probability of B minus the probability of "A and B".

Example: 1. If we have some objects that are all either blue or made of glass, and there are 20 blue items, 30 glass items, and 10 items that are both blue and made of glass, there must be 20+30-10 = 40 total items.
2. (highly unrealistic) If we know the probability of rain tomorrow is 40%, the probability of high winds is 50%, and the probability of getting at least one of those things is 90%, then we know the probability of getting both is 0%: 90 = 40+50-(chance of both). I.e., we have shown they are mutually exclusive.

This material will not be tested; it just to help you understand and remember the formula for combinations.

We know by the multiplication principle that the number of ways to order n objects is n!, or n•(n-1)•(n-2)•...•2•1. If we want to order only k of those n items, we stop multiplying after k terms, since we don't care about ordering the "left over" objects. The terms we use start with n, so we can think of subtracting values from n that start with 0 and go through k-1 (since starting at 0, that's the kth value): n•(n-1)•...•(n-(k-2))•(n-(k-1)). The values from n-k and down are omitted. Factorial notation gives us a compact way to write the truncated product: n!/(n-k)!. As an aside, this counts the number of permutations of k objects out of n objects.

The formula for n choose k is n!/((n-k)!k!), and we've accounted for the terms involving n. The k! on the bottom of the fraction is because we've ordered the k objects we chose out of the set of n, and we only want to know how many ways to pick them, without being concerned for order. In the set being counted by n!/(n-k)!, every selection of k items appears in all its possible orderings. That is, every selection of k items appears k! times, so we must divide out by k! to get rid of the overcount.

As an aside, note you can think of the n!/(n-k)! portion itself as dividing out to eliminate overcount: if we order all n objects, but only care about selecting and ordering k of them, then n! overcounts by a factor of the number of ways to order the remaining n-k objects, or (n-k)!.

»   Quiz 2 (Apr 9) will cover all of Chapters 14 and 15, including supplemental exercises, and will include a question about factorial by itself.       Answers (pdf)

You are now equipped to look at a lot of probability problems. For instance, Bradley Efron has shown that if you make 4 dice with particular numbers on their faces, they behave in an odd way (I read about this in Innumeracy, by John Allen Paulos, and here is an article about these sorts of dice):
Die A has 4 on four faces and 0 on two faces
Die B has 3 on all six faces
Die C has 2 on four faces and 6 on two faces
Die D has 5 on three faces and 1 on three faces.
If you roll them against each other, you find that A beats B 2/3 of the time, B beats C 2/3 of the time, C beats D 2/3 of the time, and amazingly D beats A 2/3 of the time - they loop! You can give the probability calculations that show this theoretically.

As I mentioned in class, you can also solve the Monty Hall Problem. The set up is that there are three doors, with a new car behind one and goats behind the other two. You pick one, and then Monty opens one of the doors you didn't pick - revealing a goat - and asks you whether you'd like to keep your original door or switch to the remaining unopened door. The better strategy is to switch, and you are now equipped to show both that it is and by how much (before clicking on the link, think about it - you can simplify a bit by deciding the car is always behind door #1 and looking at choosing each of the three doors, or vice-versa, since the full scenario is three essentially identical, mutually exclusive copies of the simplified version).

Part II: Descriptive Statistics

We will begin using the computer at this point. You may use any program you like, but the "official" program will be Excel. It is far from the best statistics package out there (if you are ambitious, try R), but it is easily obtained at Dartmouth and is used widely in the business world. I've found three reasonable tutorials for doing stats in Excel. The Clemson physics department has an overview-style tutorial, which you may prefer if you are already comfortable with Excel and just want to know the syntax of the appropriate commands. Smart has a user-friendly tutorial that is easy to navigate and more complete than the previous. University of Baltimore has what appears to be the most complete page, but it is a little harder to navigate. We will work with Excel in class as well.

The book does not have a computational component, so we must add it. The first topic is making histograms (in Excel, column charts). We will do an example in class, but here is the short version. To make a histogram, create one column of x-axis values (events or outcomes) and make the next column to the right y-axis values (frequencies or percentages). Leave the cell above the left column blank and put the title of the histogram in the cell above the right column. Highlight both columns and click the "charts" button or option in the "insert" drop-down menu (depending on your version of Excel); follow the instructions. If you want to change the color or separation of the columns, double-click on one of them once the chart has been made.

If you originally have just a list of outcomes and you need to count them, Excel has a frequency command to do that. You must make a column of outcome values (they should be in increasing order but do not need to be consecutive; if y appears just below x, Excel treats y as the range of numbers >x and ≤y) called bins. Highlight a column of blank cells as tall as your column of bins and type
=frequency(data range, bins range)
and do the special array-function version of "enter". On a Mac this is command-enter (⌘-enter) and on Windows it is ctrl-shift-enter. For example, if you have 5 bins in cells B2 to B6, and your data is in A1 to A50, it makes the most sense (for later histogram creation) to highlight C2 to C6. The command is then
=frequency(A1:A50,B2:B6)     (⌘-enter / ctrl-shift-enter).

Excel fun fact: If you are making a bins column and the values are evenly spaced, Excel can fill them in for you. Put in the first two, and highlight those two cells. Grab the bottom right corner of the highlighted region and drag down. Excel will fill in the values with the same spacing as your first two values.
To find the range of your data you can use Excel's min and max functions. For the example above they would be =min(A1:A50) and =max(A1:A50) executed in blank cells, with an ordinary "enter".

A quick note on the book: root-mean-square (section 4) is written in that order to match up with function composition: root(mean(square(data points))).

We added a few statistics here; for a uniform presentation I'll go through those in and not in the book together. There are three common single-value descriptions of a data set, the mean, median, and mode. The mean is the arithmetic average (=average(data) in Excel). The median (=median(data)) is the value x such that (at most) half the data points have value less than x and (at most) half have value greater than x (some may have value equal to x). By hand, you find it by arranging your data set in increasing order and taking either the value in the center, or the mean of the two values in the center should you have an even number of data points. The mode (=mode(data)) is the value or values that occur most frequently. Unfortunately, in Excel only the value which reaches maximum count first is reported as the mode, so the data set {2,2,2,3,3,3} will have mode 2 written in that order and mode 3 written as {3,3,2,3,2,2}, when in fact it is bimodal with modes 2 and 3.

Note that although in this class, as in the textbook, we will use "average" as a synonym to "mean", in the outside world it is often used as a nontechnical term that could indicate the mean or the median, as "middles", or even occasionally the mode, as the number you are most likely to choose if you pick a value at random from the data set.

If your data's histogram is bell-shaped (approximately normal; see Chapter 5), the mean, median, and mode will be very close in value.

For the mean and median we have associated values that measure spread. The median goes with the range and quartiles: the range of a data set is its least and greatest values (=min(data), =max(data)), and the first and third quartiles are the 25% and 75% points; they generalize the median, which could be called the second quartile (=quartile(data, i) for i=1 or 3). Put together, the minimum, first quartile, median, third quartile, and maximum are called the five-number summary of the data set and may be displayed in a box plot. The word "quartile" should remind you of "percentile", and indeed the quartiles give you the 25th and 75th percentiles of the data set.

The mean goes with the standard deviation, the root-mean-square of the data. There is a subtlety here we will not address until later; right now the ramification is that in Excel you need to use =stdevp(data) (or in other statistical calculators, the population SD rather than the sample SD). The square of the standard deviation is called the variance and is sometimes used itself, though not very often because it does not have the same units of measurement as the data and standard deviation do.

As we will see, the standard deviation does not work as advertised for some data. When the data is bell-shaped, it is a good measure of spread: about 68% of the data will be within one standard deviation of the mean, and about 95% within two standard deviations. If the data is not bell-shaped, it is anyone's guess. Quartiles and percentiles always work as advertised - why do we not use them? If we have the entire data set, there is no reason not to; they give more information. However, it is difficult to draw conclusions with any certainty about the percentiles of a full population given data about only a sample, and easier to do so for standard deviation. Since statistics loses a lot of its utility if we can't draw conclusions from samples, we need to focus on standard deviation.

» Links for lecture April 12: Excel files for standard deviation: calculation example as comma-delimited (csv) or tab-delimited (txt), effect of outliers example as comma-delimited (csv) or tab-delimited (txt). The second file is simply a data set I made up to demonstrate a point. The first file is a data set I created using random.org's coin flipper and dice roller, and needs a little explanation: in the first data set, I treated heads as 2 and tails as 1/2, and multiplied the result of pairs of flips and rolls. In the second data set, using the same flips and rolls, I treated heads as 1 and tails as -1 and added the result of pairs of flips and rolls. Part of the point of the first file is to compare the actual spread of the data to the spread it should have if it is normally distributed - since in these cases we have all the data, we can make an explicit comparison. One can see the first data set (the product) is wildly different from the normal distribution, and the second (the sum) is far closer. Later, we will want to draw conclusions about populations based on data from samples, and the conclusions we draw might only be valid in the case of a normal distribution.

Nothing in particular to add here.

» Links for lecture April 14: Change of scale examples, as a webpage so I know the graphs will work for you. I also have it as an Excel workbook file (xls).

»   Quiz 3 (Apr 16) will cover all of Chapters 3, 4, and 5. You should bring your answers to Supplemental Exercise #1 from both Chapter 4 and Chapter 5; no other outside materials should be included.       Answers (pdf)

»   Midterm 1 (Apr 21) will cover all of Chapters 3, 4, 5, 6, 13, 14, and 15, including both textbook and supplemental exercises. Note that the quizzes are basically spot checks of the homework and as such are not comprehensive of the material that will be on the exam.       Extra review problems       Comments on quizzes       Answers (pdf)

Part III: Correlation and Regression

Making a scatterplot in Excel is very much like making a histogram. This time your data can go all the way up to the top - no need for that blank upper left cell. You highlight the two columns of data, with corresponding values in side-by-side cells, and click the "charts" button or option in the "insert" menu (depending on how your Excel looks). There will be a scatter plot option (in my version it is labeled "XY (Scatter)").

A few notes on working in Excel: I do not know any easier way to swap which variable is on the x-axis and which on the y-axis than to swap your data columns. If you have variables A and B and want to be able to plot with each one playing the role of x, it might be easiest to make three columns, A, B, and A again, and then highlight either the first two (putting A on the x-axis) or the second two (putting B on the x-axis) to make the scatter plot.
If you want to put all the data into standard units, there are two ways to take advantage of Excel's highlight-and-drag feature. If your mean is in B1, SD in B2, data in column A, and you simply put "=(A1-B1)/B2" when you drag Excel will change not only A1 but B1 and B2. Instead, you can either replace B1 and B2 with the numerical values of the mean and SD, or, if those values are complicated, you can put "=(A1-\$B\$1)/\$B\$2". The dollar signs tell Excel not to change those reference cells. In either case when you highlight that cell and drag the bottom right corner downward, A1 will change but the rest will not.

To compute the correlation coefficient in Excel, your data must be in columns (or rows) and the points must be in the correct corresponding order for variables x and y (e.g., in the average tiger weight example from class, the same species must be first, second, third, etc for male and female). The command is =CORREL(x-data, y-data). The data need not be in standard units, but it will not change the outcome if it is.

» Links for lecture April 16: A nice page with lots of correlation examples, including those that show you the risk of reading too much into the single number.

» Links for lecture April 23: Excel files for correlation of different estimates of body fat: Original file (csv or txt), averaged for ecological correlation (csv or txt), and put in standard units (csv or txt). The original data came from StatLib. The BMI estimate for body fat was computed using Deurenberg formula #1 as in this paper, which we will look at for other reasons as well. The lean body mass estimate was computed as instructed on this page, using the US Navy's circumference formula. A take on causality.

Excel will draw a regression line for you. Once you have created a scatter plot, click the dots to highlight them, and go to the "charts" drop-down menu. There will be an option reading "Add trendline", and you want a linear trendline/regression.

The method of least squares, which is a way to verify that the regression line really does minimize the root-mean-square of the prediction errors, is beyond the scope of this class (in particular it requires calculus), but if you are curious here is a description.

»   Quiz 4 (Apr 30) will cover all of Chapters 8, 9, and 10. Calculators will be allowed and you do not need any Excel printouts. There will be a computational component - how is r computed, how do you use the summary statistics to predict one variable from another - but also more of an interpretive component than previous quizzes. See for example chapter 9 review (p 153) #7, 8, 10, 11, and chapter 10 review (p 176) #4, 5, 7, 8.       Answers (pdf)

Ecological correlation and the regression fallacy do not appear in every statistics book, though they are important concerns in interpreting statistical data. Here are some additional resources: a paper by David Freedman, one of the authors of your textbook, on ecological inference (it goes well beyond just ecological correlation); an excerpt from the book Interactive Statistics that has an example related to ecological correlation and a section on the regression effect, and a webpage that extensively discusses the regression effect and in particular its connection to public health decisions.

» Link for lecture May 5: our graph of averages example is available in xls format, tab-delimited plain text, or as a webpage (without the full data set in that last case, but with the averages data set and the scatterplot with regression lines).

Part V: Chance Variability

An alternative to box models is probability distributions; when the number of tickets is large this can be an easier way to organize the information. Box models are always legitimate, of course, but looking at an example we'll see how probability distributions may be used in their stead.

Suppose you have a game where you spin a wheel which is divided up into sections that have different colors corresponding to different prizes (or lack of prizes). You pay \$1 to play. If the wheel stops with a pointer on blue, nothing happens. If it stops with a pointer on yellow, you get your dollar back. If it stops with a pointer on green or red, you get \$5 or \$10, respectively. If 1/3 of the wheel is blue, 1/3 is yellow, and 1/6 each are green and red, what are the expected value and standard error for one round of the game?

The box model here is not very difficult: one ticket corresponding to each of green and red, and two corresponding to each of blue and yellow. What are the outcomes? Remembering to take the price of the game into account, red goes with \$9, green with \$4, yellow with \$0 and blue with -\$1. If you want to make a probability distribution, you can take the proportions from the wheel directly:

 outcome probability 9 1/6 4 1/6 0 1/3 -1 1/3

To average, multiply across and add the products: 9/6 + 4/6 + 0 - 1/3 = 11/6 = \$1.83. You can confirm that this is the same average you get from the box model (and see this is not a game you would find at an actual carnival).

For standard error it's not as straightforward (since SD itself is more complicated to compute), but still useful. Now you take the squared differences between the outcomes and the mean, multiply those by the probabilities and add, and finally take the square root. What will be square-rooted here is: (7.17)^2/6 + (2.17)^2/6 + (-1.83)^2/3 + (-2.83)^2/3. The SE is 3.62.

The other extra example is determining what outcomes should be in order to get a specific expected value, such as to make a price fair. One of the most significant applications of this is pricing insurance. Suppose a man is thinking about a service plan for a delicate piece of electronics, of the kind that tends not to be repairable when it breaks. If the device has a probability of 0.1 of breaking in the next five years, and the replacement cost is \$1500, what is the fair price? Well, thinking in terms of box models, we have one "breaks" ticket and nine "doesn't break" tickets. The outcome associated with "breaks" is 1500 minus the cost of the service plan, and the other nine tickets have outcome negative the cost of the service plan. The average is: [(1500-x) + 9(-x)]/10 = 150 - x, where x is the price. The fair price is \$150.

»   Quiz 5 (May 7) will cover Chapters 11, 12, and 16, and the first two sections of Chapter 17. Calculators will be allowed and you do not need any Excel printouts.       Answers (pdf)

The remaining portion of Chapter 17 is bits and pieces: a shortcut to calculate the SD, another class of examples of box models, and some notes on the normal curve. We have already seen probability histograms, which treat the number of ways to get a given outcome as its frequency and make a histogram in the same way as if it were a data set obtained experimentally. "Officially" probability histograms do not appear until Chapter 18, but we have incorporated them here. The other thing we will pull from Chapter 18 is the idea of the Central Limit Theorem, which for us basically says that although the probability histogram for a particular box may be far from normal, as we make more and more draws and sum the results, the histogram approaches normal. As a consequence, for large numbers of draws it is valid to estimate percentages of outcomes in given ranges using the normal curve, just as in Chapter 5. We will leave the exact meaning of "large" unspecified; there is an extensive body of statistical research devoted to pinning that down in various settings.

»   Midterm 2 (May 12) will cover all of Chapters 8, 9, 10, 11, 12, 16, and 17, plus the bits pulled from Chapter 18 and specified above. This includes both textbook and supplemental exercises. Note that the quizzes are basically spot checks of the homework and as such are not comprehensive of the material that will be on the exam. Calculators are allowed on this exam.       Extra review problems       Comments on quizzes       Answers (pdf)       illustration for #10 (pdf)

Part VI: Sampling

General Notes:

We skipped Chapters 19, 22, and 23: 19 and 22 are essentially examples, so you may want to read them for extra understanding, and 23 is how to apply what we covered in Chapters 20 and 21 to averages instead of sums/percentages. The only thing we added is a little newspaper literacy: when a paper reports a candidate had 53% support with a margin of error of ± 4%, they (generally) mean the 95% confidence interval for the support for that candidate is 49-57%.

Part VIII: Tests of Significance

General Notes:

We're skipping around a bit here to see what all we can fit in before the quarter ends. The total: Chapter 26 Sections 1-5; Chapter 28; Chapter 27 Sections 1-2.

Excel commands: The z-test does not require Excel. Excel has a very nice built-in function for the chi-squared test, called (appropriately enough) CHITEST(observed frequencies, expected frequencies). Your data needs to be all in a row or a column if you are testing a distribution, with the expected frequency in another row or column in the same outcome order. If you are testing for independence, you need the data in the same format as the tables we've been looking at in class, and a second table with the expected frequencies. For example, if you had observed data in the first 5 cells of the first column, and expected frequencies next to them, the command would be =chitest(A1:A5,B1:B5). If you had a 3x3 table where the observed values were in the second, third, and fourth rows of columns B-D and the expected values were in the same rows of F-H, the command would be =chitest(B2:D4,F2:H4). The value Excel gives you is the P-value, the probability, rather than the chi-squared value itself; also, it gives it as a decimal, not a percent. It counts the degrees of freedom automatically (which is why the format of your data is important: a 9-outcome distribution test has 8 degrees of freedom, whereas a test for independence between two sets of three values each has only 4, though each of them has 9 observed and expected frequencies).

» Links for lecture May 19/20: Excel file for chi-squared test examples: Australian national lottery as txt or csv; cremation temperatures as txt or csv.

»   Quiz 6 (May 21) will cover Chapters 26 and 28, though the chi-squared test only for distribution, not independence. Calculators will be allowed and you do not need any Excel printouts.       Answers (pdf)

»   The Final Exam (June 4) will cover Parts II, III, IV, V, VI, and VIII, except for Chapters 7, 19, 22, and 29. Also, Chapter 18 will be covered only as much as it was for Midterm 2 (central limit theorem), and the Chapter 27 problems will only cover the two-sample z-test for averages (Sections 1 and 2). This includes both textbook and supplemental exercises. Note that the quizzes are basically spot checks of the homework and as such are not comprehensive of the material that will be on the exam. Calculators are allowed on this exam.       Extra review problems       Comments on quizzes       Topics list (pdf)

Back to the main Math 10 page