Using IPUMS Data and SQL to Constructively Speculate on Alternate History

I start off with a little history and some speculation, then use IPUMS data for a reality check. In this post I assume you have some familiarity with SQL, but not necessarily IPUMS. If you know SAS, SPSS or Stata or R you can get the same results.

I hope I have explained enough about my IPUMS data sources to allow you to make sense of the results. What’s IPUMS? Finally, I consider a couple more sophisticated applications of IPUMS data to the topic.


Recently I heard a story on the radio “Losing Ground” (from the excellent “Reveal” investigative journalism podcast) about an African-American career military veteran and farmer in North Carolina and his struggle to keep his farm after the local Farm Service Agency treated him extremely unfairly. Basic assistance that was routinely given out to white North Carolina farmers was denied to him.

In his case a couple of employees seemed to have intentionally tried to withhold loans as he was starting up the farm, dragging out the process terribly. Then, after he got approved, the same officials delayed releasing the money and didn’t warn him of the delay.

Without the money the farmer had planned on – based on the loan approval – he couldn’t keep a lot of his hogs alive over the winter. After that blow it was nearly impossible to recover. Nevertheless, a few years later, he was actually doing all right… Until the FSA changed his loan terms without his knowledge. He didn’t find this out until it was too late.

In 2016 his farm was foreclosed and taken along with his house and even his pet dogs. The property was later auctioned off to a neighbor. The case is extremely well documented; if you have any doubts, listen to the podcast linked above.

This sort of experience mirrors the complaints raised in the Pigford class action lawsuit, settled in 1999 with USDA. Even with this settlement several years in the past, the FSA in North Carolina didn’t quit its discrimination, behaving as if the settlement never existed.

Origin of “Forty Acres and a Mule”

The experience in the story strikes a nerve, as black farmers have been contending with broken promises from the U.S. government ever since the end of the Civil War. The story referenced the “Forty Acres and a mule” concept circulating at the end of the war. There was discussion on how to help newly freed slaves in general, but the “Forty Acres” was never a policy put into practice, though it could have been, had Lincoln lived. The idea likely spread from a specific military order by General Sherman.

In Georgia in the Winter of 1864-1865, General Sherman met with community leaders of freed slaves to find out what help they needed and what they wanted the army to do in the aftermath of the war. (Short answer: provide land.) Immediately following the meeting Sherman signed Field Order 15 giving some forty-thousand freed slaves land along the southeast coast, each plot forty acres per household. No mules were given, but army mules were loaned out.

The newly freed people held onto the land only briefly: Andrew Johnson reversed the order in automn of 1865. Had Lincoln not been assasinated with Johnson replacing him and interfering with the Freedmen’s Bureau, , substantial reparations in the form of land reform might actually have been made. Before the reversal of the order the new Freedman’s Bureau had authority to give out forty acres to freed slaves, not limited to those covered under Sherman’s field order. So it’s reasonable that people were expecting “forty acres and a mule” to be a fair approximation of what the government would do as part of reconstruction.


How would the U.S. look without the legacy of Booth and Johnson? How much smaller would the wealth gap between African-American and white households be today?

Thinking about this question quantitatively is interesting. What are forty acres worth today? How much could have realistically have been distributed? How many households would have qualified for such a policy? By answering these questions with data from the Census we could establish a baseline amount for a reparations program.

In the next section I show how to get some answers with IPUMS data.

Probably a lot of the land would have come from new homesteads on the frontier, not the established farms in the south, since granting frontier land rather than confiscation would have been the expedient political choice.

Though, if reconstruction were actually carried out the way it started, who knows. Part of the purpose of the original field order was to punish Southern planters. As such, redistribution of land from plantations might have been extensively carried out.

Either way, land ownership creates some political power: If large scale land grants had been made the ability of whites to curtail new black political power would have been reduced. If these land grants were primarily in the plains states, settlement patterns would have developed differently. Migration from the South would have occurred somewhat earlier than it did, and now we’re talking alternative history.

Using IPUMS Data

To get at the total number of possible forty acre land grants, I used data from IPUMS. The system allows users to request specific census questions from census micro-data (individual records.) Users then download that data and perform statistical analysis on it, often as simple as cross tabulating two variables. This can be extremely powerful when you are just exploring the data, you don’t know which published Census tables you want, or no such tables were ever published.

I wanted to establish how many people could have qualified for the grants of land if they had been universal for freed people. Then I tried to estimate a lower and upper bound for how many formerly enslaved men might have been in positions to have taken advantage. (From my reading, any such policy would have applied mostly to men heading households of four people.) For context I included the white male population from the same states.

Data comes from the 1%U.S. IPUMS sample for 1870 and the 2013 American Community Survey, (“US1870a”, and “US2013a”) I used the IPUMS variables PERWTREG, STATEFIP, RACE, HISPAN, SEX,, RELATE, IMPREL and AGE.

To produce the tables in this post I queried a MySql database full of IPUMS data. I made extracts from the IPUMS data extraction system (go to the IPUMS-USA site, select “browse and select data.”) After selecting variables and datasets you will need to log in to download data. If you don’t have an account you will need to create an account first and fill out a short registration form. Registration is completely free.

After downloading the data extracts I loaded them into MySql. Here’s one way to do this with Ruby

If you’re comfortable with stats programs like Stata, SPSS or SAS just make an extract and request data in one of those formats. Also you may find the IPUMS-R useful for R. In my case I know SQL so that’s what I used.

You might also request data in “CSV” format and load it into a database or manipulate it with Python and Pandas.

SQL and Data Details

A quick word about my schema: There are two variable types in USA data, “person” and “household”. Think of these types as belonging to two different tables. One household can have some people and each person belongs to one household. In my queries you see the use of implicit inner joins between these two tables.

The data is actually stored hierarchically inside the extract system. Typically when you make a data extract you ignore this detail and repeat household variables for every person in the data, (this is the default) though users do have an option to download data in the hierarchical format.

I chose to make a hierarchical extract and use the Ruby ‘hflr’ gem to load the data into MySql, putting the household records in the ‘households’ table and the person records into the ‘people’ table. The data tables are prefixed with the name of the IPUMS dataset they came from, so there is one data table per IPUMS dataset and record type.

The “sum(PERWTREG)” column in the select clause gives the number of people matching our queries. The PERWT and PERWTREG variables hold the person’s “weight” value, that is, the number of actual people they represent. The data in these tables is from one percent samples of census data, so each person represents approximately one hundred people. We don’t simply return “count(*) *100” because some people may be under or overweighted, depending on how the sample of the data was constructed. Certain small communities may be “over sampled” so that too many individuals aren’t completely lost; we might sample every twenty people and then give them a weight of twenty. This technique allows for representation of small groups without skewing the totals.


To start with, for some context, here’s the over all population in 1870 broken down by race, for the former confederacy and all other states:

Union States:

	select  round(sum(perwt/100)), round(race/100) as race_gen 
	from  us1870a_people,us1870a_households 
	where = us1870a_people.household_id and	
		STATEFIP not in (1,5,28,37,12,13,45,47,21,48,22,51,29) 
	group by  round(race/100);
  • (I’ve omitted the code to replace the RACE_GEN codes with their labels) RACE is represented in IPUMS as numbers, as are most other variables. See the variable documentation for mappings from codes to their labels.
  • The STATEFIP is a coded variable for states numbered alphabetically. The list in the query is the set of codes for former Confederate states, so here I’m excluding them.

|              25220127 |        WHITE 		|
|                 61383 |        CHINESE 	|
|                605322 |        BLACK 		|
|                 23518 |        NATIVE 	|
|                   929 |       OTHER	 	|

Former slave states:

	select round(sum(perwt/100)), round(race/100) as race_gen 
	from us1870a_people,us1870a_households 
	where = us1870a_people.household_id and	
		STATEFIP in (1,5,28,37,12,13,45,47,21,48,22,51,29) 
	group by round(race/100);

Same as the last query except include only ex-Confederate states.

|               4262382 |        BLACK 		|
|               8229786 |        WHITE 		|
|                   748 |       OTHER 		|
|                  3012 |        CHINESE 	|

Notice the abssence of a Hispanic / Latino category. This is captured by the HISPAN IPUMS variable; most Native Americans aren’t counted here either. At the time (1870) the census enumerator recorded race of the person as he judged it. These numbers categorize people by the original race question; for comparison to modern race responses – which are self reported and recently allow multiple responses – IPUMS has other variables. See the full RACE documentation.

Now to narrow down the population to answer the original question, considering all male heads of household between ages 22 and 60 in 1870, subtracting ones from non-slave states and removing a few too young or old to benifit:

	select round(sum(perwt/100)), round(race/100) as race_gen 
	from us1870a_people,us1870a_households 
	where = us1870a_people.household_id and	
		AGE > 22 and SEX = 1 and	 AGE < 60 and
		IMPREL = 1 and
		STATEFIP in (1,5,28,37,12,13,45,47,21,48,22,51,29) 
	group by round(race/100);

The IMPREL variable is an imputed version of the IPUMS RELATE variable. The ‘1’ value means the person is the head of the household. In 1870, 1860 and 1850 we must impute the person’s relationship to head of household since the question wasn’t asked until 1880. (This imputation is an interesting machine learning problem.) For determining head vversus non-head status of people it’s extremely accurate.

|    557178 |        Black |
|   1138195 |        White |

Some new households would have been formed had the land re-distribution been implemented; so we may wish to account for those living as workers on someone else’s land and siblings and grown children of African-American heads of household. And if we expand the upper age range – other relatives could have worked the land and helped the older owner – we get some more. So, we’ll just drop the head of household (IMPREL=1) restriction. For all black men between ages 22 to 70 in 1870 – meaning they would have been 18 to 66 in 1866 – we find:

	select round(sum(perwt/100)), round(race/100) as race_gen 
	from us1870a_people,us1870a_households 
	where = us1870a_people.household_id and	
		AGE > 22 and SEX = 1 and  AGE < 70 and
		STATEFIP in (1,5,28,37,12,13,45,47,21,48,22,51,29) 
	group by round(race/100);
|    750699 |        Black |
|   1589105 |        White |

Now imagine some number of people between the lower to upper number got that land. Assuming that wealth had been held as land what would it be worth today? This isn’t too unreasonable; even if many farmers sold out early the same wealth could have been used to start small businesses, which well might have appreciated much more than farm land.

We first need to find how many African-American households there are today. To get a reasonable breakdown of household by race we need to include the HISPAN variable mentioned earlier.

	select sum(perwtreg),round(race/100) as race_gen, if(hispan>0,'Yes', 'No') as HISP  
	from us2013a_people 
	where relate=101 group by race_gen,HISP order by race_gen asc;

Keep in mind these are head of household people only,(RELATE=101) we’re ignoring the rest of the population. That’s so that we can get counts of households, which is how a lot of income and wealth data is measured. We want to know how many households to project the original hypothetical land grants onto. Hispanic origin isn’t in the RACE variable as it spans many races and is asked as a separate question in the ACS. HISP of ‘Yes’ means any hispanic origin; there is more detail in the original HISPAN variable.

| sum(perwtreg) | race_gen | HISP |
|      80609169 |        1 | No   |
|       9597922 |        1 | Yes  |
|      13855470 |        2 | No   |
|        308541 |        2 | Yes  |
|        430170 |        3 | No   |
|         30269 |        3 | Yes  |
|       1514725 |        4 | No   |
|        107113 |        4 | Yes  |
|          5592 |        5 | Yes  |
|        336263 |        5 | No   |
|         33582 |        6 | Yes  |
|       2834102 |        6 | No   |
|       3700287 |        7 | Yes  |
|        796096 |        7 | No   |
|        373202 |        8 | Yes  |
|       1399291 |        8 | No   |
|        212008 |        9 | No   |
|         70274 |        9 | Yes  |
20 rows in set (1.01 sec)

Key to the RACE_GEN categories:

1	White	
2	Black/African American/Negro
3	American Indian or Alaska Native
4	Chinese
5	Japanese
6	Other Asian or Pacific Islander	
7	Other race, nec
8	Two major races
9	Three or more major races

Valuing the Land Today

Looking at average prices of farmland including equipment per acre we get an average of $4100 for 2016, with some land in the cornbelt up to $8000 and mountain states as low as $1800. At $4100 forty acres are worth $164,000. The top number of possible recipient households, based on the tables from 1870 would be 750,000, giving us a total of 123 bilion dollars worth of land. This assumes all the land remained in agricultural use and had an average value; a good portion may have gotten developed, in the process getting bought at higher than farm land values.

spread out over all households with descendants of enslaved people (estimate 13,800,000 from the race by household table) from the 2013 American Community Survey , we get an addition to net worth of $8913.

For the median black household in 2016 this would substantially enhance their net-worth which stood at $17,600; but as the median white household net worth was over 171,000 they would still lag far behind whites even when controlling for education, spending habits, employment. Here’s the data source for the report with slightly newer but similar results from the Fed.

More Speculation

Of course, the land would have had a more important effect than acting as an investment; and farmland – not to mention real estate in general until the latter half of the twentieth century – is a poor investment. Data going back to 1890 shows land doesn’t appreciate more than the general rate of inflation. However, owning land gives you leverage; you can borrow against it; it gives you a safety net to subsist on in lean times. Fundamentally it gave people in the nineteenth century a base to begin accumulating wealth, rather like owning a home does today, but more effectively than modern home ownership, since the land generated income.

Modern home ownership is more like a mediocre savings account you can live inside of. Nevertheless, at the median, home ownership makes up about two-thirds of wealth for whites, so land ownership is very relevant.

To get an idea of how much the forty acres might have helped, beyond simple contributions to net worth, we’d like to look at the effect net worth has on following generations. At the time the per-household effect of the land would have been quite large, presumably. Unfortunately it’s hard to isolate the effect of wealth on future generations

It might be possible to simulate the fortunes of people, given a financial starting point. Also, a brute-force method of examining all census records and following social mobility of individuals and families could be used. Some work has been done in this area both in estimating general rates of social mobility and looking in particular at the U.S. during the time we’re considering.

Linking Census Data Across Decades

Datasets of people taken from the U.S. and British census since 1850 have been linked together across decades to produce longitudinal datasets that allow researchers to track mobility across several generations. From 1850 to 1880 mobility was particularly high in the U.S. Given a better starting point African-Americans could have shared in that mobility. By sampling black families from the linked data we could adjust for the effects of discrimination, outright theft and violence perpitrated against African-Americans (and perhaps better quantify it to begin with.)

IPUMS has linked census data based around the full 1880 Census. The longitudinal IPUMS data is actively being added to and improved. Later releases will include many more people because full versions (not samples) of all decades from 1850 to 1940 will be used to search for links. And since there will be many more people, there will be enough to produce datasets linked across many decades, allowing one to trace children and grandchildren and calculate mobility rates.


Ssocial status turns out to be about 75% inherited. While wealth isn’t exactly status, an injection of wealth after the war would have provided a huge head start to ex-slaves to raise their status , which had been artificially low to start with.

A monte Carlo method would be worth a try. If we inject this wealth (40 acres per household) in 1865 across 750,000 households and apply the same trends as were seen in black households with that level of wealth to the present we may get a realistic lower bound value of what was lost with Johnson’s revocation of Field Order 15 and his interference with the Freedmen’s Bureau.