Etikettarkiv: data

Master data and local data

Lately I’ve been involved in a project for charging of banking fees at a major Swedish bank.  My role here has been architect and system analyst.  The project is about introducing a new charging solution in the system landscape as well as to sunset an old solution.

As with all large corporations, the system landscape is complex and contains lots of systems.  There is one system that deals with customer data and our system has to interface to that to get customer information.  The other system is the master for customer data across the whole business.

Now, there is a need to enrich the customer information that is available in the central system.  One example is that we need to store an invoicing address, and there are other examples as well.  The invoicing address is not available in the central system today.

So, now we have two options on how to do this:

First, we could augment the central system with the invoicing address information and second we could let our system keep the invoicing address as well as the other examples of information that is needed.

How would you do it?

There are, of course, pros and cons of either approach.  The first approach is beneficial since it makes the invoicing address easily available for other systems should they need it.  Also, it keeps the customer information together to provide a fuller picture of the customer information in one place.  The second approach, however, is more inline with the current solution and it requires much less analysis and development work.  It makes the solution for this project simpler and introduces fewer dependencies into our solution.

On a theoretical level, one could say that information that is clearly global across the business should be mastered in the global system and information that is clearly specific and local to one system should live in the local system.  Also, one needs a mechanism to link records in the global system with records in the local one.

But this case is not so clear….  Invoicing address, is that local to charging, or is it interesting from other points-of-view as well?  The answer is not so clear, more a matter of opinion….

Datensparsamkeit and quality of data

I just read Martin Fowlers blogpost on the concept of Datensparsamkeit that you can find here

My summary of his post is that he argues that organizations should only store the data they really need and not, in this day and age of big data, store all data they can get their hands on. His primary concern is that of privacy.

I fully support his concern of privacy, but reading the blogpost it got me to think about another reason for Datensparsamkeit.

In my current project we have been collecting data from various sources, internal to the organization as well as external. After all, the project is about a data warehouse and those are fundamentally about collecting data. In some situations the data was not needed right away, but for various reasons we built the interface to collect it anyway. After this, the interface was put into production and the data collection was started.

In all of the situations when we collected data that was not needed right away, we run into serious data quality issues. We had situations when the interface was broken for months, no data was collected nobody realized it. We also had a situation where a calculation to generate derived data was seriously flawed for months and nobody realized it.

Since the data was not needed right away, nobody was ensuring the quality of it and therefore the collection of the data was not only a big waste, it also led to serious rework to fix the problems.

So, in addition to the privacy concerns of Datensparsamkeit, I would also add serious data quality issues that leads to Datensparsamkeit.

Fail-fast and garbage files

Today it happened again… Why don’t we have a better way of handling this?

My project (as I probably mentioned before) is a data warehouse solution to calculate market risk for a bank. As all data warehouse solutions much of what we do deals with data in files and data in tables. The tables themselves are represented as files also (we are using SAS solutions). Over time things change, names change, definitions change, machines change etc. But in many cases the old files that are not relevant anymore do not get deleted and they become garbage that lays around the various directories of our system. Everything works fine and nobody is worried.

Then one day we move to a new machine, or to a new environment, or to something else. All of a sudden, the garbage is not there anymore. And we get a crash. Turns out that everything was not fine, since the system did not read the files we thought it read but instead it read the garbage. With no garbage, nothing to read, and therefore a crash. This has happened to us at least four or five times over the past 3 weeks (we are in deployment mode right now).

The agile movement talks about failing fast. You should design your systems so that any bugs make the system crash and crash early. This way, the bugs are found early, and can be fixed quickly and cheaply.

Our data warehouse system, based on SAS has evidently not heard about failing fast yet. Also, I do not see that mindset in the community of data warehouse developers that I have met in this project. Could this be a big opportunity for improvement? Time for me to start some convincing…..