## Analyzing and Representing Data: Overview

The process of collecting, organizing, representing, describing, and analyzing data and using data to make decisions or predictions has become a more important skill than ever. Through the media, people are exposed to statistical data from opinion polls, advertisements, surveys, the stock market, and medical research. Understanding the uses and misuses of data can help you and your students make informed decisions about education, financial matters, health care, politics, and the environment. Understanding what happens to various statistics when new data are added or existing data are deleted is an important aspect of data analysis. Looking for **clusters** (intervals containing many data points) and **gaps** (intervals that contain no data) may provide useful information so that decisions can be made. For example, if a teacher finds that the test scores on a math test cluster around 75, he or she might decide to add several easier problems and one or two more difficult problems to provide a greater range of scores. Understanding, representing, and analyzing data and making reasonable conclusions based on the data are skills all students need to learn.

The three common measures of central tendency, **mean**, **median**, and **mode**, each name one number that is in some way representative of all the numbers in a data set. Each of the three measures can be more or less representative of a given data set.

The mean is found by adding all the numbers in a data set and dividing that sum by the number of data points. However, as we shall see, the mean can be greatly affected by **outliers**, values that are much greater or much less than most of the data.

The median is the middle number of a data set arranged in order from least to greatest. Since the median is not greatly affected by outliers, it may be a better measure to use than the mean when outliers are present. For example, if the hourly salaries for five restaurant employees are $8.75, $9.12, $9.30, $10.20, and $18.40, a new employee with little experience is probably more likely to be offered $9.30/hour (the median) than $11.15/hour (the mean). In this data set the outlier of $18.40 has increased the mean substantially.

The mode is the number or numbers that appear most often in a data set. It is usually not affected by outliers. It is probably the least useful measure if the data are numerical. However, it is very useful and usually the only measure that makes sense if the data are categorical—that is, nonnumerical, such as data collected to answer the question, “What is the most common household pet among students in this class?”

The **range** of a data set is a measure of how widely dispersed the data are. It is found by subtracting the least value from the greatest value in a set of numerical data.

Let's examine some data sets and the effects that adding and deleting data and outliers have on the mean, median, mode, and range. Consider this data set: 2, 4, 5, 5, 5, 6, 6, 7, 8, 9.

The mean of this data is 5.7, the median is 5.5, the mode is 5, and the range is 7. If we include another data point, such as 42, the mean would increase to 9, the median would increase to 6, the mode would remain the same, and the range would become 40. Adding this single outlier increases both the mean and the range substantially. However, the median increases by only 0.5, and the mode does not change at all.

Consider another data set: 8, 17, 19, 20, 21, 22, 22, 22, 25, 26, 27, 28, 29. In this particular data set, the mean, median, and mode are 22 and the range is 21. The value 8 is an outlier, since it is substantially less than the other values. If we take away the value of 8, we get this data set: 17, 19, 20, 21, 22, 22, 22, 25, 26, 27, 28, 29. The median is the same for the new data set. The mode also does not change, because 22 is still the number that occurs most frequently. However, the mean increases by 1.2 points to 23.2, and the range decreases substantially to 12.

There are many ways to present data. **Bar graphs** are used to represent discrete data, with the bars representing the data. **Double bar graphs** are often used to compare two data sets. For example, the double bar graph shown below compares the percent of women workers in four different occupations in 1975 and in 1990. By placing the bars for 1975 and 1990 side by side for each occupation, it is possible to conclude that the percent of women is increasing in occupations that require strong mathematics and science backgrounds.

Source of data:

*The Universal Almanac,*1992

**Line graphs** are used to represent continuous data, such as trends over time. **Double line graphs** are used to compare two continuous quantities, such as the tuition and fees for four-year public and private colleges over a span of time as shown below. The double line graph shows that the difference between the tuition and fees for private colleges and public colleges has increased over the 14-year span studied. A smooth curve is drawn between points to indicate a trend.

Source of data:

*The Universal Almanac,*1996

A **stem-and-leaf plot** is a type of graph that preserves numeric information by using the numbers in the graph.

Let's take a look at the number of shots taken by the Evans School basketball team in their last nine games: 58, 67, 57, 74, 63, 59, 72, 65, and 64. First, we arrange the data in order (57, 58, 59, 63, 64, 65, 67, 72, and 74) and then, using 5, 6, and 7 as the stems (the tens digits) we arrange the leaves (the ones digit) in the form of a horizontal bar graph.

A **box-and-whisker plot** is a relatively new method of representing data. It gives a visual display of the median and it provides information about the range, the upper and lower quartiles, and the distribution of the data. The box contains the middle half of the data with the ends of the box being the lower and upper **quartiles**. The median splits the data set in half, and the lower quartile is the median of the lower half of the data. Similarly, the upper quartile is the median of the upper half of the data. The line drawn in the box is the median. The whiskers extend from the lower quartile to the lowest number in the data set and from the upper quartile to the greatest number in the data set.

Using the data from the stem-and-leaf plot shown above, the median is 64, the lower quartile is the mean of 58 and 59, or 58.5, and the upper quartile is the mean of 67 and 72, or 69.5. To draw the box-and-whisker plot, we need a number line that includes the values from 57 to 74, as shown below.