Today it happened again… Why don’t we have a better way of handling this?
My project (as I probably mentioned before) is a data warehouse solution to calculate market risk for a bank. As all data warehouse solutions much of what we do deals with data in files and data in tables. The tables themselves are represented as files also (we are using SAS solutions). Over time things change, names change, definitions change, machines change etc. But in many cases the old files that are not relevant anymore do not get deleted and they become garbage that lays around the various directories of our system. Everything works fine and nobody is worried.
Then one day we move to a new machine, or to a new environment, or to something else. All of a sudden, the garbage is not there anymore. And we get a crash. Turns out that everything was not fine, since the system did not read the files we thought it read but instead it read the garbage. With no garbage, nothing to read, and therefore a crash. This has happened to us at least four or five times over the past 3 weeks (we are in deployment mode right now).
The agile movement talks about failing fast. You should design your systems so that any bugs make the system crash and crash early. This way, the bugs are found early, and can be fixed quickly and cheaply.
Our data warehouse system, based on SAS has evidently not heard about failing fast yet. Also, I do not see that mindset in the community of data warehouse developers that I have met in this project. Could this be a big opportunity for improvement? Time for me to start some convincing…..