Beating another dead horse.
Although I live in Copenhagen and mostly walk or ride my bicycle in order to get around town, I do own an old car for getting around the rest of the country. In Denmark, cars go through mandatory official inspection every other year, and I've been through a few of these in my life. A few years ago, the mechanic doing the inspection informed me that my car's chassis number was incorrect.
This did make me a bit nervous, because I'd bought the car used, and I was suddenly concerned that things weren't really as I thought. Had I unwittingly bought a stolen car?
But the mechanic just walked over to his computer in order to correct the error. That's when a different kind of unease hit me. When you've programmed for some decades, you learn to foresee various typical failure modes. Since a chassis number is an obvious candidate for a natural key, I already predicted that changing the number would prove to be either impossible, or have all sorts of cascading effects, ultimately terminating in official records no longer recognizing that the car is mine.
As it turned out, though, whoever made that piece of software knew what they were doing, because the mechanic just changed the chassis number, and that was that. This is now five or six years ago, and I still own the same car, and I've never had any problems with the official ownership records.
The reason I related this story is that I'm currently following an undergraduate course in databases and information systems. Since this course is aimed at students with no real-world experience, it wisely moves forward in a pedagogical progression. In order to teach database keys, it starts with natural keys. From a didactic perspective, this makes sense, but the result, so far, is that the young people I work with now propose database designs with natural keys.
I'm not blaming anyone. You have to learn to crawl before you can walk.
Still, this situation made me reflect on the following question: Are natural keys ever a good idea?
Let's consider an example. For a little project we're doing, we've created a database of the World's 50 best restaurants. My fellow students suggest a table design like this:
CREATE TABLE Restaurants ( year TEXT NOT NULL, rank TEXT NOT NULL, restaurantName TEXT NOT NULL, cityName TEXT NOT NULL);
Granted, at this point, this table definition defines no key at all. I'm not complaining about that. After all, a month ago, the students probably hadn't seen a database table.
From following the course curriculum, it'd be natural, however, to define a key for the Restaurants
table as the combination of restaurantName
, cityName
, and year
. The assumption is that name and city uniquely identifies a restaurant.
In this particular example, this assumption may actually turn out to hold. So far. After all, the data set isn't that big, and it's important for restaurants in that league to have recognizable names. If I had to guess, I'd say that there's probably only one Nobelhart & Schmutzig in the world.
Still, a good software architect should challenge the underlying assumptions. Is name and city a natural key? It's easy to imagine that it's not. What if we expand the key to include the country as well? Okay, but what if we had a restaurant named China Wok in Springfield, USA? Hardly unique. Add the state, you say? Probably still not unique.
Ensuring uniqueness is only the first of many problems with natural keys. You may quickly reach the conclusion that for a restaurant database, a synthetic key is probably the best choice.
But what about 'natural' natural keys, so to speak? An example may be a car's chassis number. This is already an opaque number, and it probably originates from a database somewhere. Or how about a personal identification number? In Denmark we have the CPR number, and I understand that the US Social Security Number is vaguely analogous.
If you're designing a database that already includes such a personal identification number, you might be tempted to use it as a natural key. After all, it's already a key somewhere else, so it's guaranteed to be unique, right?
Yes, the number may uniquely identify a person, but the converse may not be true. A person may have more than one identification number. At least when time is a factor.
As an example, for technical-historical reasons, the Danish CPR number carries information (which keys shouldn't do), such as a person's date of birth and sex. Since 2014 a new law enables transsexual citizens to get a new CPR number that reflects their perceived gender. The consequence is that the same person may have more than one CPR number. Perhaps not more than one at the same time, but definitely two during a lifetime.
Even if existing keys are guaranteed to be unique, you can't assume that the uniqueness gives rise to a bijection. If you use an external unique key, you may lose track of the entities that you're trying to keep track of.
This is true not only for people, but cars, bicycles (which also have chassis numbers), network cards, etc.
Finally, even if you've found a natural key that is guaranteed to be unique and track the actual entity that you want to keep track of, there's a final argument against using an externally defined key in your system: Data-entry errors.
Take the story about my car's chassis number. The mechanic who spotted the discrepancy clearly interpreted it as a clerical error.
After a few decades of programming, I've learned that sooner or later, there will be errors in your data. Either it's a clerical error, or the end-user mistyped, or there was a data conversion error when importing from an external system. Or even data conversion errors within the same system, as it goes through upgrades and migrations.
Your system should be designed to allow corrections to data. This includes corrections of external keys, such as chassis numbers, government IDs, etc. This means that you can't use such keys as database keys in your own system.
Many were the times, earlier in my career, when I decided to use a 'natural key' as a key in my own database. As far as I recall, I've regretted it every single time.
These days I follow a hard heuristic: Always use synthetic keys for database tables.
Is it ever a good idea to use natural keys in a database design? My experience tells me that it's not. Ultimately, regardless of how certain you can be that the natural key is stable and correctly tracks the entity that it's supposed to keep track of, data errors will occur. This includes errors in those natural keys.
You should be able to correct such errors without losing track of the involved entities. You'll regret using natural keys. Use synthetic keys.
https://bxt.org/ajfur