The following glossary should help decode terms used in IPUMS documentation and source code, as well as discussions to follow on this blog.
These terms describe important concepts in IPUMS, most of which parallel more commonly used terms in software development. IPUMS uses concepts and vocabulary from the demographics and survey data domains.
Data Product, or “product”
A product consists of a family of more-or-less comparable datasets typically drawn from one original source. For instance , the USA product contains samples of the U.S. census data from 1850 to present. Additionally, to make it “IPUMS” data, specialist researchers have “harmonized” the data to maximize comparability between datasets within the product.
A collection of records of one or more type. For instance the U.S. 1990 census 1% sample dataset contains a one-percent sampling of all census records from the U.S. 1990 census. If the dataset contains more than one record type the dataset retains a logically hierarchical structure, as all records, if not the “root” record type, belong to other records. An IPUMS record holds responses to questions from a survey or census, as well as many computed values based on those responses.
Record, Record Type
All records have a type. Think of the type as a unit of analysis – person records store information pertaining to an individual person, household records hold information about a single household which typically has many person records linked to it, and person records in most datasets may belong to a household record. Some data products have datasets with nothing but person records, though most have more complex data. Health data may have datasets with many types of records, such as injury records recording information about specific injury events experienced by a specific person, doctor visit records, and many more detailed types of records. These records would belong to one person record. In relational database terms a record corresponds to a row in a table and a record type corresponds to a table type or definition.
The set of possible responses to a given question or computed value and all cases of those responses; essentially a column from a record set in a dataset. Variables possess many additional attributes such as a coding structure (codes for responses and labels for the codes), short labels, names, lengthy descriptions and more, including a “universe”.
Criteria under which a variable has a possible valid value within a given dataset. For instance, the universe of the SCHOOL variable depends on the value of the AGE variable in the U.S. census. People not of the valid school age (5 to 18 approximately) don’t get their school attendance measured.
Variables with a set number of values have “categories,” – a set of values and labels for those values. For instance SCHOOL has three categories: 1: Yes, in school, 2: No, not in school, 3: Not applicable / out of universe.
Within a data product, comparability means the degree of similarity between two or more datasets in terms of variable coding, presence of the same variables in two or more datasets, and similarity of variable universes between datasets. Differences in data collection practices can contribute to degrees of comparability: For instance, do respondents fill out a questionnaire, do they talk to a census worker in person, or do they respond over the phone or internet?
From an RDBMS point of view, think of products as loosely analogous to a schema, a dataset to a set of connected tables, a record set to a table and variables to columns in those tables.