Profiling your data
Posted on June 22, 2008 by tom
Filed Under Analysis |
Profiling the data can be described as getting an overview of the data. This overview gives you an indication on the maximum, the average, lowest figure, the highest, number of distincts etc. From such an overview, one is able to know if the data comply to certain business rules. One also gets a hunch of the data: if we have amounts, is it expressed in cents or euros. Or if weights are involved, do we have a weight that is expressed in gramms, kilos or tons.
A mathematician considers profiling the data as an easy job: only key statistics (frequencies, averages) are derived and no high level mathematics is involved. I remember the face of a mathematician, who after having explained what data profiling was, asked the question: what the hell are talking about? Data profiling is of more interest to the analyst, who wants to know if the data comply to buisiness rules and who would like to get more acquintance with the data.
Data profiling is mostly done during the analysis of an ETL project when data are to derived from sourcing systems and to be put in a data warehouse. Before building starts, we need to know the likely content of the data. Based on this knowledge, one could establish a dataflow that takes into account the particuliarities of the data. Example: if we know from the analysis that dataes of birth are subject to human error, we might include a check to see if the date of birth is within an acceptable range.
Previously such analysis was done thru a series of queries, in order to calculate the key statistics for every attribute in the database. Nowadays alternatives haven arisen that make life less complicated. I will mention 3 such alternatives:
In the Oracle Warehouse Builder, a wizard exists that creates a nice overview for an external table that is imported in OWB.
The SQL Server 2008 also includes a wizard to do the data profiling.
Lastly freeware tool exists that can be downloaded from “arrah technologies”. It has a fully functional version that can be used for a limited period. After that, it should beĀ bought. To install, one should have the exact version of java that is required (1.5.0_09). Also, the way to get it started is somewhat tricky: it will start with “java Profiler”, which is not documented. In their tutorial, they showed a different command line, which did not get it running. Finally the data base connection should be set manually, which can also be tedious. But after all that, we have an intuive profiling tool that does what it should do: give insight into the data by means of key statistics.
In this article, I used much information from Henk Jan te Brake. Thanks to him; errors in this short overview are mine!
Comments
Leave a Reply






