1 Unit 1: Exploring One-Variable Data 1.1 Introducing Statistics: What Can We Learn from Data? 2 1.2 The Language of Variation: Variables 3 1.3 Representing a Categorical Variable with Tables 7 1.4 Representing a Categorical Variable with Graphs 15 1.5 Representing a Quantitative Variable with Graphs 22 1.6 Describing the Distribution of a Quantitative Variable 36 1.7 Summary Statistics for a Quantitative Variable 47 1.8 Graphical Representations of Summary Statistics 67 1.9 Comparing Distributions of a Quantitative Variable 77 1.10 The Normal Distribution 90
Unit 1: Exploring One-Variable Data 2 Topic 1.1 Introducing Statistics: What Can We Learn from Data? What Can We Learn from Data? Unit 1 introduces the concept of data and describes how data can vary in the real world. In certain circumstances, this variation may suggest certain conclusions about data. However, not all variation is meaningful. The study of statistics will help us understand and make sense of uncertainty and variation. In this unit, we will learn about categorical and quantitative variables, and how to represent them appropriately. We will also learn to describe and compare distributions of data that consist of a single variable (one-variable datasets) using tables, graphical representations, summary statistics, or a combination of these methods. These statistical methods will help us assess claims about a particular data point or about groups of data as in a sample. Later in the unit, we will be introduced to the normal distribution. The application of the normal distribution will be our first step in understanding how some distributions of sample data can be described using theoretical models for populations. This brief introduction to the normal distribution is a building block for later units, where we learn to use probabilistic modeling to make statistical inferences. To prepare for the AP exam, we suggest focusing on developing the following skills: • For categorical variables be able to: o Represent and describe variables using frequency tables, relative frequency tables, and bar charts. • For quantitative variables be able to: o Represent and describe variables using histograms, stem-and-leaf plots, dotplots, and boxplots (including the five-number summary). o Describe the distribution in terms of shape, center, and variability (spread), as well as any unusual features such as outliers, gaps, clusters, or multiple peaks. o Recognize skewness (left and right) and how the mean relates to the median in skewed distributions. o Calculate the main measures of center, position, and spread. o Understand and apply the empirical rule for normal distributions. o Determine proportions and percentiles from a normal distribution (using z-scores). o Use percentiles and z-scores to compare the relative positions of points within a dataset.
Unit 1: Exploring One-Variable Data 3 Topic 1.2 The Language of Variation: Variables You Will Learn To: • Differentiate between individuals and variables in a dataset. • Identify the basic types of variables. Variables and Data Statistics deals with the collection, analysis, interpretation, and presentation of data. One essential piece of vocabulary when working with data is a dataset. A dataset is a collection of data. Example 1.2.1 The table below is a dataset that contains data from students at Northside High School who participated in an out-of-state math competition. Every piece of data in a dataset reflects a characteristic of an individual. In the example dataset, the individuals are students. Typically, individuals in a dataset are represented by rows.
Unit 1: Exploring One-Variable Data 4 A characteristic that changes from one individual to another is known as a variable. When a characteristic does not change, it is called a constant. Example 1.2.2 In the dataset from Example 1.2.1, there are 3 variables (student name, class level, and GPA). Each of these characteristics has a different value for each student. However, the high school characteristic is a constant, not a variable. Its value must stay the same because all the individuals come from the same high school (Northside). Basic Variable Types Some variables, like hair color and blood type, use labels (categories) to describe individuals. Other variables, like height and test score, use numbers to quantify a characteristic of individuals. A categorical variable is a variable that takes on values that are category names or group labels. A quantitative variable is a variable that takes on numerical values for a measured or counted quantity.
Unit 1: Exploring One-Variable Data 5 Example 1.2.3 The dataset from Example 1.2.1 contains 3 variables. • Two of the variables (student name and class level) include labels rather than numbers, so they are categorical. • One variable (GPA) takes on numerical values for a measured quantity (grade point average), so it is quantitative. Categorical variables can be numbers as well if those numbers do not have numerical meaning. For example, if the student names from Example 1.2.1 were replaced with an identification number from 1 to 6 that has no meaning other than labeling the students, it would still be a categorical variable. • Non-numerical variables are always categorical variables. • A numerical variable is categorical if it is not possible to do arithmetic with the numbers and they are not the result of arithmetic. Before you test your skills, check your understanding of identifying and classifying variables.
Unit 1: Exploring One-Variable Data 6 1.2 Check for Understanding 1. Below is a dataset from a large database of plants that includes the plant's common name, mature height measured in feet (ft), growth rate, and minimum root depth measured in inches (in). Which of the variables recorded in the dataset are categorical variables, and which of the variables are quantitative variables? A. Common name, mature height, and growth rate are categorical variables. Minimum root depth is a quantitative variable. B. Common name and mature height are categorical variables. Growth rate and minimum root depth are quantitative variables. C. Common name and growth rate are categorical variables. Mature height and minimum root depth are quantitative variables. 2. The National Football League (NFL) is a professional football league consisting of 32 teams split between 8 divisions (ex. AFC North, NFC east). Suppose that each of the 8 divisions were labeled with a different number from 1 to 8. Which of the following statements about the variable representing division would be true? A. The division variable would be quantitative because the labels from 1 to 8 are numerical. B. The division variable would be categorical because the labels from 1 to 8 only represent groups.
Unit 1 - Exploring One-Variable Data 7 Topic 1.3 Representing a Categorical Variable with Tables You Will Learn To: • Identify an appropriate way to represent data for a single categorical variable. • Construct a frequency table from data for a single categorical variable. • Create a relative frequency table from data for a single categorical variable. Frequency Tables and Relative Frequency Tables It is difficult to understand data when looking at a list of numbers. This topic and the following topics will cover tables, graphs, and statistics that are commonly used to summarize and represent data. For categorical variables, a frequency table gives the number (frequency) of observations falling into each category. A frequency table is appropriate for a single categorical variable. Example 1.3.1 A researcher has compiled data about 328 films produced by a film studio since it was founded. The researcher recorded the genre of each film. The data is summarized in the frequency table below. Notice that the frequency table is much more compact than a table with 328 rows representing each movie.
Unit 1 - Exploring One-Variable Data 8 When the number of observations in each category is of interest, a frequency table is useful. However, the proportion (relative frequency) of observations in each category is often easier to interpret. A proportion in fraction form represents a part (numerator) of a whole (denominator). To create a relative frequency table, divide each frequency in a frequency table by the total number of observations (sum of the frequencies). Example 1.3.2 To create a relative frequency table from the frequency table in Example 1.3.1, consider that there are 328 films in the list. Divide each frequency (part) by 328 (whole) to calculate the relative frequencies. Note: The relative frequencies have been rounded to two decimal places. Sometimes the total number of observations is not given. In that case, add up the frequency of each category (add up all parts) of the categorical variable to find the total number of observations (whole). A frequency table can also be appropriate for quantitative variables if the number of possible values is small.
Unit 1 - Exploring One-Variable Data 9 Example 1.3.3 A pediatric dermatologist is investigating the types of skin conditions she diagnoses most often in the emergency room. Over one week, she diagnosed 30 cases of eczema, 25 cases of hives, 18 insect bites, 28 cases of non-atopic dermatitis, 15 cases of chickenpox, and 60 cases of other conditions. To construct a relative frequency table for these data, notice that the dermatologist diagnosed 30 + 25 + 18 + 28 + 15 + 60 = 176 skin conditions over the week. Divide each frequency (part) by 176 (whole) to calculate the relative frequencies. It is also possible to construct a frequency table from a relative frequency table if the total number of observations is known. Simply multiply each relative frequency by the total number of observations.
Unit 1 - Exploring One-Variable Data 10 Example 1.3.4 A teacher recorded the manufacturer of each of the 25 laptops used by her students and then calculated their relative frequency. The results are shown in the table below. To construct a frequency table from this relative frequency table, multiply each relative frequency by 25 (the total number of laptops). If the total number of observations is not given, it is not possible to construct a frequency table from a relative frequency table. The Usefulness of Frequency Tables and Relative Frequency Tables Relative frequencies are useful when information about percentages or rates is important. However, both frequencies and relative frequencies reveal information that provides insight into data. In general, the context dictates which table is most useful and the type of conclusions necessary about the data.
Unit 1 - Exploring One-Variable Data 11 Example 1.3.5 A retailer stocks different grades of edible olive oil, with premium extra virgin olive oil being the highest quality. A manager believes that higher-grade olive oils (premium extra virgin olive oil, extra virgin olive oil) represent most of the olive oil sold by the retailer. The manager recorded how many of each type of edible olive oil was sold over the last month. To determine whether the manager's belief is true, determine the relative frequency of bottle sales for higher-grade olive oils. To find the whole (denominator), add all numbers in the frequency column. To find the part (numerator), add the frequencies for premium extra and extra virgin olive oils. Finally, to calculate the relative frequency, divide the resulting frequency (part) by the total number of bottles sold (whole). The relative frequency of higher-grade olive oils sold is approximately 0.55, which is greater than 0.50. Therefore, the manager's belief is valid for sales over the last month. A frequency (or relative frequency) table shows a list of the possible values of categorical data along with how often those values occur.
Unit 1 - Exploring One-Variable Data 12 This is known in statistics as the distribution of a variable. In general, every variable (categorical or quantitative) has a distribution. It will be very important to analyze distributions of variables to understand the unique features of the data those variables represent.
Unit 1: Exploring One-Variable Data 13 1.3 Check for Understanding 1. The frequency table shown below summarizes the distribution of final letter grades for a chemistry class. Which of the following is the relative frequency of students with a final grade of B or higher in the chemistry class? A. 0.21 B. 0.30 C. 0.40 D. 0.70 E. 0.90 2. The manager of a new soccer academy polled its 200 players to select the color for their uniforms. The results of the poll are shown in the relative frequency table below. Which of the following is the number of soccer players who selected white for their uniform? A. 8 B. 16 C. 32 D. 168
Unit 1: Exploring One-Variable Data 14 3. Ayana surveyed a group of students about a new uniform proposal for her school. The students were asked if they were in favor, not in favor, or undecided about the uniform proposal. She wants to describe the frequency of students surveyed who are not in favor relative to the total number of students surveyed. Which of the following would readily provide the information Ayana seeks? A. A frequency table B. A relative frequency table 4. A teacher asked the 20 students in a classroom which season of the year they liked the most. The results of the survey are shown in the relative frequency table below. Notice that the information about students who liked fall the most is missing. Which of the following is the frequency of students who liked fall the most? A. 8 B. 10 C. 12 D. 20 E. 40