The process of collecting, organizing, displaying, and interpreting data and using it to make decisions or predictions has become an important skill in today's society. People are inundated through the media with statistical data from opinion polls, advertisements, surveys, and medical research. Understanding the uses and misuses of data can help you and your students make informed decisions about health care, the environment, education, financial matters, and political issues. Understanding what happens to various statistics when new data are added or existing data are deleted is an important aspect of data analysis. Also, knowing about sampling techniques and having the ability to recognize biased data is very important when it comes to having faith in the validity of conclusions one might draw from analyzing the data.
The three common measures of central tendency, mean, median and mode, are used to find one number which is representative of all the numbers in the data set. Each of the three measures has particular strengths and weaknesses that pertain when dealing with various data sets.
The mean is found by adding up all the items in the data set and dividing the result by the number of items. That's the method used by most people when they refer to the "average." However, as we shall see, it can be greatly affected by outliers, or values which are much greater or less than most of the data.
The median is the middle item if the data are arranged in order from least to greatest. The median is not greatly affected by outliers.
The mode is the measure that appears most frequently. It is not affected greatly, if at all, by outliers. It is probably the least useful measure if the data is numerical in nature. However, it is very useful and usually the only measure which makes sense if the data is categorical that is, non-numerical, such as "What is the most common color for an automobile?"
Let's take a look at some particular data sets and the effects that adding and deleting data and outliers have on the mean, median and mode. Consider the data set below:
The mean for this data is 5.222, the median is 5 and the mode is 3. If we include another item of data such as 23, the mean would increase to 7, the median would increase to 5.5 and the mode would remain the same. The increase of the mean to 7 is a substantial increase based on one item of data. The median, although it changed, did not change very much. The mode did not change at all, but was not very representative of the data set to begin with.
Consider another data set:
In this particular data set, the mean, median, and mode are all 22. The value of 8 is an outlier, since it appears to be substantially less than the other values. If we add 17, 18, 50, and 54 to the data set, we get this data set:
The median does not change for the new data set, since two of the new data items are greater than the median and two are less than the median. The mode also does not change because 22 is still the data item most frequently repeated. However, the mean increases by 3 points to 25. The outliers of 50 and 54 have increased the mean substantially in this case.
Let's look at what would change if instead of adding the four data items we added above, we added 29, 29, and 30 to the data set.
The new mean increases to 23.25, since all three new data items are greater than the previous mean. The median also increases to 23.5, since it is the average of the eighth and ninth data item. The data set now has two modes since both 22 and 29 are each repeated three times.
Data is collected from a population, or group. When a population is so large that collecting data from every member of the population is impractical, data is collected from a sample, or a part of the population. One of the most important tasks a statistician has when collecting data from a sample, is to make sure that the sample is a random sample and not a biased sample. In a random sample, all members of the population have an equal chance of being selected. In a biased sample, not all members of the population have an equal chance of being selected, so the data collected might not truly reflect the population. A common mistake in obtaining a sample occurs when the investigator collects a random sample from a subset of the population which does not reflect the whole population. An example of this is to collect data about political candidates from readers of a certain newspaper. Since most readers of that newspaper may have the same political leaning, the results would not reflect the voter population in general.
Assume we want to find out what the three most popular flavors of ice cream are among 300 sixth grade students at a school, without having to ask every single student. Let's discuss the pros and cons of the following samples.
The first sample, consisting of only 10 students, is too small and may not reflect the whole population of students. The second sample of 30 students getting off a bus is a biased sample because not all students come to school on buses so not all students would have an equal chance of being selected for the sample. The third sample seems to be large enough to get a good idea about the views of the 300 students, and since the students are being selected randomly, everyone would have an equal chance of being chosen.