There are too many different ETL engines to mention, but I have some opinions, and some information about these different products to share. If you have questions, this is the place to ask. Although any technical answers requiring code will be shared on my blogging site: http://danLinstedt.com – here I will speak about the technology itself.
Basics about these tools and where they came from…
Grew up as a metadata driven, GUI development environment. The engineers and the founders decided it was a good idea to model ANSI-SQL types of commands in a visually driven drag & drop style. They also decided that keeping this information in a metadata repository was a good idea. So they set on their way. Foundationally strong in Metadata, they also made a key decision to put their metadata in to table structures in standard database systems (hence the repository). However, they wanted to protect their metadata repository as “intellectual property” that was owned by the company, so they removed indexes, removed primary and foreign keys, and they named some of the elements the same way.
This had it’s upsides and down sides to the tool – especially being metadata driven, the focus was not performance nor parallelism until they reached version 5. In version 5, they listened to their user base (which was growing by leaps and bounds because of ease of use), and decided to add parallelism, and partitioning, and thus was born the PowerCenter product (they started PowerCenter products in version 3 as an “enterprise class”) – it spawned from the PowerMart line of code. They went on to mix, match, and re-merge PowerMart back in to one line of product called PowerCenter (which we know and love today). They added HA, fail-over, clustering, and grid. Last year they added “PowerCenter” to the cloud computing space.
Their focus today is still very much metadata (and very strong), but they are now extremely fast and can perform extremely well under the right circumstances.
** NOTE: I’ve never signed an NDA with the company, never had to – the information I retrieved was from reading their publicly filed patents, and talking to their folks (seeing their product demonstrated) at trade shows **
Grew up as a speed mongor. They did not start with metadata first, their whole purpose in life was bent on making the base OS: a) fault-tolerant b) fast c) partitionable flat file access. – that’s right, it’s all 100% flat-file driven. So what they did was create something called the Co-Operating system. The co-operating system basically is a parasite. It sits on top of your existing OS, and pluggs in to the I/O interrupt level – it assumes and overrides 100% of all file operations on the machine. Even files that aren’t “running” through Ab-Initio go through the AB-I interrupts at the file system level.
They then, constructed their own “data description and manipulation language” known as ab-initio scripts (much the same way C-Shell, or K-Shell acts in unix). They interpret these scripts at the start of a run-request (compile them really), and then figure out mathematically what operations can be split across which resources. Their only notion of a shared repository (at that time) was the systems-performance metadata. They gather this metadata from each registered system upon install to make a profile of the machine’s capabilities. This makes heterogeneous activity possible, and load-balancing mathematically easy. It’s also why when you upgrade your OS, or upgrade your hardware that you’re asked to re-run the machine profile, and re-register it with the “central brain”.
But as I said, the Co-Operating system is parasitic… It renders the rest of the OS (for all intensive purposes) useless, even though they’d like you to believe it’s seemless and harmless. Yes, some things still work properly – but others give spurious errors (unknown and untraceable errors). Anyhow, I digress. They then developed a GUI that allowed designers to build “data flows” – graphs they call them… These graphs were then “exported” to the co-operating system (produced if you will). What that process really does is generate Co-Operating system scripting code.
The unfortunate part of this, is: programmers decided they could take the generated code and make it better/faster… so they tweak it. Once they tweak it, it’s disconnected from the metadata graph in the GUI… and bang, your maintenance costs rise again – and you’ve got uncontrollable code sprawling across your enterprise. Now, let’s talk about speed… BLINDINGLY FAST – IF all your operations are flat-file based. They do a decent job (like everyone else) of making use of database connectivity, but their flat-file execution rules, it’s a whole different animal. It’s because of this one trait that Ab-initio can survive and actually compete against Tibco and MQ series, (they are a better fit to messaging systems than they are to ETL batch world). So long as you like / can manage the code-sprawl.
They built their own metadata database (an internal binary format, they told me once it was a PIC universe… sound familiar?) but then the source that told me that, soon denied this. Anyhow, binary or proprietary metadata is a serious problem in the industry.
From Microsoft, really stands for SQLServer Integration Services (formerly known as DTS – Data Transformation Services)… SSIS has come a LONG LONG way since it’s beginnings, but it too started life as a “tool to call other tools” with, or to execute procedural code in the database. It was not quite a code-generator, but not quite a metadata driven tool either. Since then however it’s been implemented in Visual Studio (awesome idea), hooked up with COM objects, and given a full on GUI development environment. It even has metadata – but where is this metadata? Alas, the file system. It’s stored like nearly every other VS project (without team-edition development that is…) in the file system. An SSIS project is a series of XML documents, albeit seriously complicated and hard to understand (the XML anyhow). but none-the-less, well formed.
SSIS is not quite a power-house that ab-initio or PowerCenter is, but it’s coming up fast. Now, if you are a 100% Microsoft shop, SSIS is THE way to go (if you don’t have additional money to spend or can’t justify the cost of another tool). SSIS is REALLY good at working with SQLServer instances, especially in SQLServer2008 R2. SSIS is lacking a metadata repository, and SSIS is lacking the ability to hook into a “data model design tool”. If Microsoft brings these together, it would be the foundation for a great product, especially if their metadata repository is in SQLServer, AND if it is open and documented metadata model.
Grew up haphazardly… One gets the feeling that this started as a pet project somewhere inside Oracle, and grew into the first edition of OWB nearly by accident. It was almost unusable in the early phases of the product, and even today has serious flaws and quirks that need fixing. Then along came Sunopsis (who’s exit strategy was clear from the get-go, sell to Oracle). They sold, and bam – Sunopsis transformations became the backbone of OWB – and some of the Sunopsis GUI got worked in to the nearly 900MB java GUI distributable that eats up client workstations.
OWB is good for one thing (in my opinion): Generating Oracle to Oracle stored procedure code. Once the code is “deployed” to Oracle as a stored procedure, it’s great – it runs fast, it’s efficient (mostly), and can be monitored. It has an in-oracle metadata repository which is very positive, but trying to understand where the metadata is stored and how is another question… All puns aside, I like working with TCL and scripting my mappings, and objects in OWB – but I CANNOT stand their GUI. It’s like working in your back yard whith your boots stuck in wet concrete – and trying to move a heavy steel pole out of your way. Again, the up-side is: once you get the steel pole where you want it, it works, it runs, and it runs well.
IF however you are not an ALL ORACLE shop, then perhaps OWB is not for you. Their flat-file support on the other hand is good, as is their ability to deal with XML. Now, if you are using Oracle Message Queues – then well, OWB is really good at that. I like the features that the Database offers – but don’t be fooled, it’s not OWB that offers these features, it IS the database (just like I said previously).
A great product, a great team of engineers, a strong background in metadata. They too (like Informatica) started with their feet in Metadata. They focused on it, so much so that they managed to hook together their BI solution to their ETL solution for data lineage. Ingenious (but they weren’t the first to try this either)… As of the past 3 years, they (like Informatica) have come a long way to fixing and improving the performance of their engine. However, nothing beats the speed of a compiled language, and while Pentaho is good, it still can’t do the heavy lifting in the environments I work in (without additional products like Hadoop and cloud computing at the Java level)…
Cloud computing changes everything, and the jury is still out on this one. They are definately a team to watch with a product to watch… Now only if they had a Data Modeling Tool…. hmmm…
A great product, another really good team of engineers. However the way Jasper (the parent company) does business with their open-source is more of a community approach. Pentaho likes to treat the repository as primary (all tools need to work together), where Talend ETL has it’s own repository – and where applicable will communicate with other tools, but it’s not the primary focus. Talend ETL (as I understand it) grew up as a high-performance minded engine where metadata comes second; and it shows.
I’m not saying this is a bad thing, I’m just saying this is where it started. Note: I see more adoption of Pentaho ETL than I do of Talend ETL – not sure why that is, but I’ve heard from customers that the GUI is hard to work with, and the metadata interconnectivity is hard to deal with. Maybe there’s something in the learning curve on this one?
How many of you even knew that Embarcadero had an ETL tool? DT-STUDIO Yup… well – don’t know if they still do, but they did for many years. Embarcadero ETL is a Java based solution (not open-source to my knowledge). It too grew up as a metadata based solution. Anyhow, it is connected to the ER-Studio (their data modeling tool). From that perspective it is very powerful. Unfortunately due to the fact I’ve not seen it in the major accounts I visit, I haven’t had enough exposure on the tool to comment further (sorry).
Ahh DataStage… The name is what the product does… It “stages” data at each transformation before passing it downstream. It was built as an ETL GUI (because there was money in the product niche that Informatica was in)… I am not particularly fond of the “old” DataStage, but the new products that IBM put out continue to get better, so I must applaud IBM for working at this product and improving it tremendously over the years. DataStage still stages data at each transform, but they’ve gotten wise (it used to do this 100% to disk caches, causing MASSIVE performance problems). Then, they introduced memory caching – and have improved the performance dramatically.
DataStage was a good middle-of-the-road product, now it’s engrained in DB2 UDB and Db2 – it’s connectivity to IBM database systems is OUTSTANDING. It’s ability to run and generate COBOL and Mainframe jobs is unbelievably strong. It’s work with (other) relational databases leaves quite a bit to be desired. If you are using mainframe, or mid-range DB2 or AS/400 systems, ADABAS, or other strange products, then DataStage might the product you want. Of course I think IBM now call it “websphere” – which is a much larger product bundle?? A great resource for this information is my friend Vincent McBurney on IT Toolbox – he’s got a great blog about WebSphere and DataStage.
One of the problems I used to have with DataStage is that every time I wanted to “re-use” a mapping, there was no copy-paste available for the GUI transformation objects. I think they’ve fixed this by now. Another issue used to be in deployment of the mappings, from development to test – to get it to a new environment required re-creation by hand. Then, a 3rd party tool came around to assist with this (still not elegant, but I think it worked).
At the heart of the metadata repository was MetaStage – it used to run only on a PIC UNIVERSE database, a proprietary binary format. Not sure if it’s still limited to this, or if IBM corrected this by implementing MetaStage and repository on DB2 UDB… But if it’s still a PIC UNIVERSE, then I would have to complain. The problems with PIC UNIVERSE are all around. If the single metadata file get’s corrupted, then ALL your metadata is done for, not just the “one mapping that wen’t bad”. This is a horrible prospect. I also never liked the fact that PIC UNIVERSE didn’t have an ANSI-SQL standard driver for it. Anyhow, those most likely are old problems.
If you have other experience, or if you wish to correct me – please do so by offering comments. If you wish to discuss other tools that you have used/seen, those comments are welcome as well. Please feel free to contribute.
I am an independent data warehousing consultant who created the Data Vault Model and Methodology. You can find out more about Data Vault by clicking: http://danLinstedt.com - on this site, you can discover all information pertaining to my consulting practice