Back to chapter

1.2:

How Data are Classified: Categorical Data

JoVE Core
Statistics
A subscription to JoVE is required to view this content.  Sign in or start your free trial.
JoVE Core Statistics
How Data are Classified: Categorical Data

Languages

Share

Data, a scientific term used for the collection of observations and measurements, forms the basis for all the statistical analyses and inferences.

Data can be classified based on whether it can be measured or not. For example, consider different hair colors. One cannot measure hair color in liters or kilometers but instead can group them into categories such as black, brunette, or red.

Such data sets are called categorical data or qualitative data; they cannot be measured or counted but can be labeled or put into different categories.

Another example is human blood, which is grouped into four different types: A, B, O, or AB.

In certain cases, categorical data can be ordered in a particular fashion; such data are called ordinal categories. For example, the size of coffee cups—small, medium, large—or the height of trees in a forest—short, medium, tall—can be arranged in the order of increasing size.

1.2:

How Data are Classified: Categorical Data

A variable, usually notated by capital letters such as X and Y, is a characteristic or measurement that can be determined for each member of a population. Data are the actual values of variables. They may be numbers, or they may be words. Datum is a single value.

Data are classified based on whether they are measurable or not. Categorical data cannot be measured; instead, it can be divided into categories. For example, if Y denotes a person's party affiliation, some examples of Y include Republican, Democrat, and Independent. Y is categorical data. Categorizing a population-based on hair color, age, sex, blood group are examples of categorical data.

In some cases, categorical data can be ordered in a particular fashion, and these fall under the ordinal category. Consider the list of the top five national parks in the United States. The top five national parks can be ranked from one to five, but the differences between the data are not measurable. Another example is a cruise survey where the responses to questions about the cruise are "excellent," "good," "satisfactory," and "unsatisfactory." These responses are ordered from the most desired response to the least desired. However, the differences between the two pieces of data cannot be measured.

This text is adapted from Openstax, Introductory Statistics, Section 1.1 Definitions of Statistics, Probability, and Key Terms