Posts Tagged etl

Enterprise Data Integration: The State of the Art

Recently I had to get up to speed on Cast Iron data integration solutions, now part of IBM. I have to admit, I went into it a bit pessimistic about the very notion of “point-and-click programming.” The basic value proposition of Cast Iron is to make it easy to move enterprise data from one place to another, for example, synchronize an Oracle database of accounts and customers with your Salesforce.com database.

cast-iron-diagram

You do this by creating Projects with Orchestrations which are made up of Activities that talk to and listen to Endpoints like HTTP, FTP, SMTP, Database. Most Activities can map and transform data before passing control to the next Activity. There are also control flow Activities like If/Then, Do/While and Try/Catch. This kind of visual activity building isn’t unique to Cast Iron of course, and it makes pretty pictures. The diagram component used in Cast Iron Studio, a Java desktop app, is quite nice to work with.

Within Activity blocks you can map data by dragging and dropping:

cast-iron-map

I learned that the mapping interface is essentially an XSLT generator. When you peek a bit under the covers of Cast Iron, you’ll discover that it’s all about converting data to XML representations, building XSLTs and converting XML back to database calls or strings or whatever your end point needs. But Cast Iron hides all that from you, mostly. Once you’ve built your Orchestration and debugged it to your satisfaction, there’s a server (“appliance”) that you can push the Project to, and it’ll host the whole thing. Or you can have Cast Iron host it for you in “the cloud.”

It’s all actually very neat once it comes together and works. And it’s true, for basic data integrations no programming is required, and you can just point-and-click your way to data integration nirvana. Sounds easy right?

Reflecting on this technology, I’m struck by one thing: data integration is still hard.

We had a small but diverse group in the training session. Most participants introduced themselves as developers or systems analysts. But even with a lot of hand holding and a very knowledgeable and effective instructor, these fairly basic exercises often proved challenging to people. At various points, each person got stuck and needed help to get unstuck. Plus these exercises seemed nowhere near the level of complexity real world data integrations would have.

Making changes required lots of hunting, clicking, dragging, typing little bits of text, waiting and repeating. Even putting aside numerous UI annoyances and glitches, the experience of building even a fairly basic Orchestration with a small handful of Activities can be pretty frustrating. Looking around the room, I felt that while people were pleased when they finally got a lab exercise working, they were concerned that they needed so much help from the instructor, and were often stymied trying to work things out on their own. Although my background made the exercises pretty straightforward for me, it was clear this stuff still isn’t quite within easy reach.

Makes me wonder: can data integration really be easy one day?

What if you could run software that works like this:

  • Asks you: Where’s your data?
  • Asks you: Where do you want to move your data?
  • Gathers and verifies all authentication information, then gets to work:
    • Performs inspection of data sources and targets, including random sampling,
    • Algorithmically generates ETL Orchestration, highlighting areas of greatest uncertainty,
    • Include logging, email/SMS alerts, error handling, apply intelligent upserts, etc. all heuristically and algorithmically determined,
    • Automatically creates “staging/sandbox” environments based on the data targets – then show you Previews of the data integrations without having to make your own staging environments
  • If needed, then deeper customization can be made by hand,
  • Once you’re satisfied, allow one click deployment and activation

Perhaps a very naive vision, but this is what Usable Data Integration would be to me. Although tools like CI are nice, and they pay a lot of lip service to simplicity, it’s still definitely not “simple” except in the simplest scenarios. Yes, the “secret sauce” would be that magical algorithm in the middle step.

Maybe integration has innate complexity and can never be made simple?

I’d like to think and dream that integration can be simple. It’s just a hard problem.

,

1 Comment