Save your data, even if your project is half-baked and you think your done with it. Years ago in 2014 I got access to an interesting dataset, and used it to come up with some potentially useful conclusions about valuing homes for sale. For whatever reason, today I remembered the work and went to dig up the data and project files to refine the queries and create a model… only to discover the data is permanently gone.

Last week I recycled several old laptops. I’d long since backed up anything on them – or so I thought. But, now I recall I’d done this little project with MySQL on one of them and left the work in the database on one of the laptops. The hard-drives have been sledge-hammered. Before I forget all the details I need to write them down.

The Data

This was listing data with final asking price, recorded sales price, and property information typically found in a listing. Something like:

ZIP	|	Listing | Sale Price | Beds | Baths | SQ FT | Lot Size | Heat | Air | Time on Market | Storeys

There were about 1450 records, all from Minneapolis.

You could now get similar data from Zillow (not under their terms of service though,) or MLS data (but not the public variety as far as I can tell.) You could reconstruct a lot of it by getting sales prices from the county property information and combining it with public MLS data. Lining up the last listing price with the matching sales event / price might be tricky in some cases. Some final sales prices aren’t recorded correctly, and sometimes you see refinancing or foreclosure sales in the data.

The Project

At the time I got this data, I was thinking of how to flag under-priced homes – typically foreclosures – to order a search for the best deals. There were more fields than I have shown above. Think of everything available in an MLS or Zillow listing. One type of “good deal” might be a house that has some hidden potential not captured in the asking price: Could some changes be made cheaply to make it more valuable? And the reverse would be important too. Does the price on paper go way over what it ought to be worth to an investor?

Anyway, I wanted to discover any interesting corelations between the house price and features of the house. For instance, does one bathroom vs. two bathrooms in a house really affect the price of an otherwise similar house – same neighborhood, same finished square feet and so on.

One notion I had was to find properties that, but for a missing bedroom or bathroom or something else, would be much more valuable. Like, could I find atwo-bed, one bath place and add a third bedroom, and add tons of value to the property in the process? (That’s assuming adding the addition was easy – such cases do exist.)

After importing the data into MySql, I grouped homes by zip code, weighted price by square feet, then looked for average price differences between, say, one and two bedroom houses. I checked bedrooms, bathrooms, storeys and (I think) central air. It’s too bad I lost the tables. You could see very noticeable differences between one vs. two bedroom homes, and two vs. three bedroom houses. Other things didn’t make much difference. The largest difference was adding a second bathroom or third bathroom: Around $30,000 price difference for the same square footage home.

But why is this interesting? Isn’t this kind of what you’d expect? Well the question you need to ask is which price am I talking about? These price differences were all on asking / listing price, not the actual sales price. When you do the same tabulations on sales price you notice that after weighting by square feet there’s nearly no corelation between house features and price in the dataset. Naturally, larger homes will have more bedrooms and bathrooms. But, a similarly sized house with two rather than three bathrooms won’t on average sell for less.

So, it may not pay to add an extra bedroom or bathroom if you’re looking to increase your selling price. I didn’t have good data on kitchens, maybe a remodeled kitchen can matter. Being a very expensive type of improvement it probably doesn’t pay off. And $30K for a bedroom or bathroom may be on the low side of what it would cost to add one, so those additions probably wouldn’t pay either, even if sales prices matched asking prices, which they don’t. I have seen cases where converting a room to a legal extra bedroom that wasn’t listed as such would cost under $5,000. But given my findings this wasn’t worthwhile; certainly not a money-maker idea. I’d love to have the data so I can double check this conclusion.

I wasn’t primarily interested in advising house-flippers though. I wanted to predict sales prices to find where the asking price was either way too low or too high or assess if it was fair. If the listing price is too low you might offer it immediately and not go a dollar under, even if that wouldn’t be typical in the current market. If it’s way too high you could feel confident in making a low-ball offer. And, if there were tons of listings to sift through, the sales price predictions could help you focus on the best potential deals first.

The model I came up with pretty much only required square feet of finished space and zip code to spit out a predicted sales price. The rest of the data turned out to be mostly useless. However, this wasn’t satisfying because the varience was quite high in the sales price data to begin with. I could get within +/-15% of the actual price or so (I wish I had the data still to check this.)

Finally I realized there was some important information the model lacked that might explain the seemingly random variance in prices: The condition of the house. Pretty obvious when you think about it. If the place needs extensive repairs – especially if the city is going to require them – that will bring down the price. If the house is pristine on the other hand it will show well and get a higher than average price.

Well, I was familiar with a source for that information. Most cities have housing inspection programs. In Minneapolis it is the “Truth in Sale of Housing” (TISH) report system. When a house is listed for sale a TISH report must be available online. An inspector – hired by the seller – visits the property before listing and fills out the report. These filled-out forms are posted online. Having read through thousands of these I can say they give you a good feel for the state of the home. Sometimes you have to read between the lines – not that the inspector is attempting to conceal issues, more that you learn the tell-tale signs of a messed up house.

To make a useful predictive model for finding bargains you would need to incorporate the seller’s inspection. To do that you’d need to develop a scoring system. I did this by hand. Certain issues will scare off the typical buyer but are really not expensive to correct, so you don’t deduct many points for those, others like cracked foundations you subtract a lot. From experience you know if a problem with the plumbing or electricity will require replacement or not.

By examining dozens of reports and comparing to the sales price of the property back in 2014 when I did this work, I decided the property’s condition explained around eighty percent of the seemingly random price differences in the sales data. My manual scoring model was likely not optimal – a better model might predict the prices even more accurately. My system downloaded the reports (PDFs,) parsed them into the most structured form I could, then applied my crude scoring system. I did that in Ruby.

The next step, and what I wanted my old data for, was to try to build a simple machine learning model with the inspection data using modern tools. Given the sales prices – I could link them to inspections – I could get the computer to decide how to score items in the inspection instead of making educated guesses. I had downloaded inspections current for the sales data I had (these seem to be lost too.) The idea was, plug in a zip code, listing price, property ID and the software would download the report, read it and give an adjusted valuation. If you wanted to get really fancy it could connect to the county property information and check for back taxes owed. And, I could get a good measurement of the model’s accuracy.

Start-Up / Software As a Service Idea

Could this software if built make a profitable service?

  • How much would it be worth it to users?
  • How hard would it be to customize for each city / region?
  • Would the information be useful in all market conditions?
  • What types of property would this software help you find?

In today’s market with very few listings, there’s little value in eliminating bad deals. It just doesn’t take long to see everything that’s for sale and figure it out for yourself. So, this system would be much better in a large market that’s active.

The TISH reports flag possible problems and required repairs and code compliance. You’ll see the most problems on the bottom end of the market or on foreclosed properties where the seller didn’t bother to fix things up before selling. It’s probably most useful in a 2009-2013 type of real estate environment.

A big limitation for this idea is that you’d have to customize for cities and counties. Every government will do things slightly differently and may not have the same rigorous inspections program Minneapolis does.

Ideally you’d have access to an MLS feed and customize for every large city but that wouldn’t be cheap.

I do think some semi-automated inspection review system could be valuable for an individual buyer. Large real estate investment companies may not want to bother with discovering one extra good deal here or there but a small investor will.

I’d love to know if the Zillow “Zestimate” uses any city inspection data. I doubt it and they don’t claim to. They claim to have a two percent median error rate on predicting final sales prices in the Minneapolis / St. Paul area. If you look at only foreclosed properties I’d guess their error rate would be a lot higher. They say they use assessor records and direct MLS feeds.