The ETL Generator - A Holy Grail (or is it?)

Posted on June 26, 2008 by Thomas
Filed Under Data Warehousing, Super Fancy Sexy, Trends |

Auto GeneratedFor some reason I can’t quite fathom myself either, I spend considerable amounts of time in all manner of datawarehouse / ETL related product development groups. (A substantial bit of that time, I should add, being my own; major system integrators put great value in R&D - just as long as it doesn’t eat into the billable hours.) And whether they called themselves Focus Groups, or Special Interest Groups, or Knowledge Groups, what they basically are all looking for is the answer to one simple question: how can we do what we do already, only better? And of course when we say better mostly we mean faster and, above all, cheaper. If your project leaders are anywhere near as sharky as mine, you’ll find that sooner or later you have to come up with a good reason why this ETL stuff is taking so bloody long.  

Not that it is unfair of them to ask. I mean, let’s face it: the Business Intelligence projects that we work on generally come with price tags that would enable you or me to spend the rest of our lives on pearly white beaches, were the money to be deposited on our bank accounts directly. And while we’re at it, let’s double face it: a substantial amount of that money - up to 75 % I’ve heard mention - is spent not on the reports that are the end result, but on building the data warehouse and the ETL that is used to fill it with juicy numbers. And to top it all off, let’s triple face it: a whole lot of those data mappings we were building in that time really weren’t all that complicated. So the question, at least, is a valid one.

One easy answer (and in fact, one you’ll frequently see mentioned on BI expert blogs) might seem to just outsource part or even all of your ETL development to a place where labor costs are lower. India, you hear say for instance, offers the same quality techies at half the price. Sounds like a good deal. I can tell you right now though that it’s easier said than done. Never mind the language and cultural differences - suddenly you’ve got your team spread out over two continents and at least eight time zones. If making a group of people all sitting in the same room work in the same direction is already such a challenge (and it is), imagine how something like that will impact your progress. Outsourcing BI development takes time, it takes experience, and it takes a whole new approach to how we get things done.

So while investing time and effort in outsourcing might still be a good thing to think about in the long run, it definitely is not the easy way (or the cheap way) out. We’ll have to make do with the resources we have at hand already. The solution will have to come from the inside.

With that in mind, we return to how we can streamline our own ETL work. I find that the discussion often quickly leads to industrialization; the  mechanization of the process through standards, templates and automation. In less fancy words: to spend less time on creating the same mapping. And the way to do it, it is often agreed, is to build ourselves a mapping generator.

For those who are not familiar with the term: mapping generators are pieces of software that can generate ETL out of your technical design, usually based on a standard design. The idea is that if you can reduce your ETL to an X number of general templates, you can combine those templates with your source-to-target dataflow documents and build the necessary ETL through some automated process.

The idea is not new. Informatica for instance has loads template mappings that can save you work, and has been offering a plugin that you can use to create mappings in Visio for ages. And even in that distant past when datawarehouses were maintained by nothing more than packages of SQL statements, there were people that automatically generated ‘insert into as select from’ statements based on lists of columns. But mapping generators pur sang - programs that create ETL for Informatica, SSIS or DataStage, are rare. Your average data logistical tool offers a whole range of bells and whistles to play with, so building something that generates ETL on the fly for it is challenging to say the least.

Challenging, but also very cool indeed. 

Just imagine it… Say you take your average datawarehouse: something snowflaked plus possibly a staging layer for good practice. Say you got one fact mapping plus ten dimensional tables in there. Two layers, eleven tables each… that’s twenty-two ETL flows. Some would be a bit more complicated and a lot of others (like in the staging layer) definitely not, but give it an average time of four hours development per flow. That’s a good two weeks of project time right there, and we haven’t even talked about testing it yet. Even if you’d just be able to delegate the easy ones to some unix script or java program, that’s a smack load of work you don’t have to do. Better yet: if the generator creates output that can be trusted, there is no chance of technical errors in the ETL, meaning that the quality will actually go up. Leaving more time for you to focus on the difficult parts, which are more interesting to work on anyway. Goodness all around! Making the mapping generator is any ETL developer’s wet dream.

Or is it? Here’s a few reasons why you might want to wait with proclaiming this the solution to all your problems just yet.

Limited By Templates.

The reason why mapping generators aren’t common practice already is also the reason why they might not be as useful as they might seem at first. ETL tools and the mappings they create are too complex to be easily copied into a homegrown program. Or rather: the things that we need the ETL to do are too complex. For any generator to work, most of your ETL needs to fit in the straitjacket of a few standard mapping templates. If you need to spend two hours retrofitting each generated mapping to build in all the things that weren’t a part of the mapping template, you’d have been better off not generating it in the first place.

Super Fancy Sexy, But No Advantage.

Making your generator so advanced that it does handle those exceptions outside of the templates as well might not be the answer either however. You can gold plate your generator if you want, but the more options you build in, the more things you will have to set up when you run it to generate an ETL mapping. You’ll quickly find that you’re spending almost as much time on specifying which of all those bells and whistles to use as you would be spending on actually building the damn mapping yourself.

Rebuilds Not Included.

And lastly, but most importantly: generators might not be saving you as much time as you initially thought they would, simply because you might not be spending as much of your project time on actually building things. Generally speaking, the creation of any ETL dataflow is only one part building and three parts revising it. Either the source data isn’t what you were expecting it to be or there was a mistake or the business requirements change, but whatever the reason, only rarely will the original mapping design also be the final one. And no amount of automatic generation will ever help you reduce that process…

Now… Having said all that, I am still convinced that mapping generators are worth the time that we spend on them. Experience shows that at least a considerable portion of your ETL will abide to template standards, particularly when you’re working in the kind of mature data warehouse environment that is most suited for large ETL tools in the first place. Using generators will seriously reduce the risk of technical errors, because of the uniformity of the output. You’re ensured that large parts of your documentation (another one of those perpetual headaches) will always be up to date, because ETL is generated directly out of the source-to-target dataflow designs. Because you need templates to make the generator work you force yourself to think about standardization, which will greatly benefit the lives of  your data warehouse administrators. And even if you only save time on a portion of the work, you are still saving time. ETL is sucking up a considerable portion of your BI project time, so any improvement in that area should be welcomed with open arms.

And on top of all that, there is the fact that building your very own mapping generator has a severely high coolness factor.

In conclusion, we can say that ETL mapping generators will not reduce the time spent on building ETL to zero; far from it. You will still be breaking your head over why the data doesn’t look like you expected it to, and you will continue to curse those horrible business users and their shifty ideas of what it is they want. Using an ETL mapping generator, it is highly likely that you will have to rebuild the stuff that is there at least as often as you are doing now. But while mapping generators might not be the holy grail some people seem to make of it, it can at least be part of the solution.

Because by taking over that which is tedious and repetitive, what they will do is help you shift your attention to those things that actually deserve it.

 

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Fark
  • Technorati
  • del.icio.us
  • Google
  • StumbleUpon
Email This Post Email This Post

Comments

One Response to “The ETL Generator - A Holy Grail (or is it?)”

  1. Bram on June 27th, 2008 11:38 pm

    How much should an ETL generator be able to do? While trying to develop an ETL generator I quickly came to the conclusion that the actual building of the generator would not be the problem, but how do I tell the generator what I want it to generate?

    After a couple of short nights thinking about ETL generators I came up with the following:
    Most ETL tools available (Microsofts SSIS, Informatica, the ever-interesting-sounding-but-never-seen-it Ab Initio and others) ARE ETL generators already. You tell them what to do in their own specific ways and they will handle it. They generate some fancy XML-code and do ETL-stuff for you.

    So, does that mean we don’t need ETL generators since we already have them? I don’t think so. A fair slice of our work consists of doing the same all over again. Using generators to .. well generate a ‘part’ of your package/mapping, this worktime can be cut by quite a lot. For the moment using templates will have to do, but I am looking forward to more advanced implementations.

    Currently we create a design document which is next turned into a mapping/package by an ETL-guy or -girl. In a very optimistic scenario I can see this design document being parsed by an ETL generator and then passed on to the ETL-guy/girl. He/she removes the last few errors caused by ‘dirty’ data or errors on the designers behalf.

    That’s where I (hope to) see ETL generators in the quite-near future.

Leave a Reply