Archive for category Programming

Enterprise Data Integration: The State of the Art

Recently I had to get up to speed on Cast Iron data integration solutions, now part of IBM. I have to admit, I went into it a bit pessimistic about the very notion of “point-and-click programming.” The basic value proposition of Cast Iron is to make it easy to move enterprise data from one place to another, for example, synchronize an Oracle database of accounts and customers with your Salesforce.com database.

cast-iron-diagram

You do this by creating Projects with Orchestrations which are made up of Activities that talk to and listen to Endpoints like HTTP, FTP, SMTP, Database. Most Activities can map and transform data before passing control to the next Activity. There are also control flow Activities like If/Then, Do/While and Try/Catch. This kind of visual activity building isn’t unique to Cast Iron of course, and it makes pretty pictures. The diagram component used in Cast Iron Studio, a Java desktop app, is quite nice to work with.

Within Activity blocks you can map data by dragging and dropping:

cast-iron-map

I learned that the mapping interface is essentially an XSLT generator. When you peek a bit under the covers of Cast Iron, you’ll discover that it’s all about converting data to XML representations, building XSLTs and converting XML back to database calls or strings or whatever your end point needs. But Cast Iron hides all that from you, mostly. Once you’ve built your Orchestration and debugged it to your satisfaction, there’s a server (“appliance”) that you can push the Project to, and it’ll host the whole thing. Or you can have Cast Iron host it for you in “the cloud.”

It’s all actually very neat once it comes together and works. And it’s true, for basic data integrations no programming is required, and you can just point-and-click your way to data integration nirvana. Sounds easy right?

Reflecting on this technology, I’m struck by one thing: data integration is still hard.

We had a small but diverse group in the training session. Most participants introduced themselves as developers or systems analysts. But even with a lot of hand holding and a very knowledgeable and effective instructor, these fairly basic exercises often proved challenging to people. At various points, each person got stuck and needed help to get unstuck. Plus these exercises seemed nowhere near the level of complexity real world data integrations would have.

Making changes required lots of hunting, clicking, dragging, typing little bits of text, waiting and repeating. Even putting aside numerous UI annoyances and glitches, the experience of building even a fairly basic Orchestration with a small handful of Activities can be pretty frustrating. Looking around the room, I felt that while people were pleased when they finally got a lab exercise working, they were concerned that they needed so much help from the instructor, and were often stymied trying to work things out on their own. Although my background made the exercises pretty straightforward for me, it was clear this stuff still isn’t quite within easy reach.

Makes me wonder: can data integration really be easy one day?

What if you could run software that works like this:

  • Asks you: Where’s your data?
  • Asks you: Where do you want to move your data?
  • Gathers and verifies all authentication information, then gets to work:
    • Performs inspection of data sources and targets, including random sampling,
    • Algorithmically generates ETL Orchestration, highlighting areas of greatest uncertainty,
    • Include logging, email/SMS alerts, error handling, apply intelligent upserts, etc. all heuristically and algorithmically determined,
    • Automatically creates “staging/sandbox” environments based on the data targets – then show you Previews of the data integrations without having to make your own staging environments
  • If needed, then deeper customization can be made by hand,
  • Once you’re satisfied, allow one click deployment and activation

Perhaps a very naive vision, but this is what Usable Data Integration would be to me. Although tools like CI are nice, and they pay a lot of lip service to simplicity, it’s still definitely not “simple” except in the simplest scenarios. Yes, the “secret sauce” would be that magical algorithm in the middle step.

Maybe integration has innate complexity and can never be made simple?

I’d like to think and dream that integration can be simple. It’s just a hard problem.

,

1 Comment

Getting the raw SOAP XML sent via SoapHttpClientProtocol

Suppose you’re using the .NET SoapHttpClientProtocol to invoke a web API.  This is what happens when you use Visual Studio to add a Web Reference and automatically build a proxy for you.  Now suppose you want to programmatically access the raw SOAP XML that you’re sending to the web API.  Sounds straightforward, right?

Turns out, it isn’t straightforward at all.

Looking online for some help, a few solutions have been proposed.  Some people suggest using a network sniffer or HTTP proxy to get the raw SOAP XML.  That can work, but it’s not a good solution for programmatically getting the XML.  It’s also somewhat labor intensive to setup initially and then use on a regular basis.

It’s probably possible to create a SOAP Extension to do this.  But that’s a bit heavyweight for my purposes.

One guy dug into the guts of the .NET assembly in the debugger to find the spot in memory where the XML document can be found.  Impressive, but again, not entirely practical for programmatic access.

After some trial-and-error, I decided upon a strategy that allows us to get what we need pretty reliably.  Hopefully this helps you.

Let’s say your WSDL has a service called HelloService (from the WSDL’s <service> tag).  When you add it as a Web Reference, Visual Studio automatically creates a nice class called HelloService derived from SoapHttpClientProtocol.  What we want to do is this:

HelloService svc = new HelloService();
svc.doSomething();
string rawXml = svc.Xml();

To add the Xml property, we could just add directly to the auto-generated class.  But that’s not ideal because if you get a new WSDL, for example, and re-generate the class, you’ll destroy changes you’ve made.  So let’s create our own subclass of HelloService.  That’s easy enough, something like this should do it:

namespace MyProject
{
   public class MyHelloService : HelloService
   {
      public MyHelloService : base() { }
      public string Xml { get { return null; } }
   }
}

Now  we should change our code to use this new subclass:

MyHelloService svc = new MyHelloService();
svc.doSomething();
string rawXml = svc.Xml();

So far so good, but what now?

Now we need to intercept the XML that gets created during SoapHttpClientProtocol.Invoke().  There’s a convenient point for doing that: GetWriterForMessage().  It’s responsible for returning an XmlWriter that gets used to build the SOAP XML message.

To do that, we’ll need our own XmlWriter that wraps another XmlWriter, the original one returned by the HelloService class.  Our strategy is to intercept all calls to the original XmlWriter and write those to our StringWriter.  It’s an XmlWriterSpy.  Here’s how it looks (some methods omitted for brevity):

namespace MyProject
{
    using System.IO;
    using System.Xml;
    public class XmlWriterSpy : XmlWriter
    {
        private XmlWriter _me;
        private XmlTextWriter _bu; // Buffer to write XML to
        private StringWriter _sw;

        public XmlWriterSpy(XmlWriter implementation)
        {
            _me = implementation;
            _sw = new StringWriter();
            _bu = new XmlTextWriter(_sw);
            _bu.Formatting = Formatting.Indented;
        }
        public override void Flush()
        {
            _me.Flush();
            _bu.Flush();
            _sw.Flush();
        }
        public string Xml { get { return (_sw == null ? null : _sw.ToString()); } }

        public override void Close() { _me.Close(); _bu.Close(); }
        public override string LookupPrefix(string ns) { return _me.LookupPrefix(ns); }
        public override void WriteBase64(byte[] buffer, int index, int count) { _me.WriteBase64(buffer, index, count); _bu.WriteBase64(buffer, index, count); }

        // ...more overrides omitted, you get the idea...

        public override void WriteSurrogateCharEntity(char lowChar, char highChar) { _me.WriteSurrogateCharEntity(lowChar, highChar); _bu.WriteSurrogateCharEntity(lowChar, highChar); }
        public override void WriteWhitespace(string ws) { _me.WriteWhitespace(ws); _bu.WriteWhitespace(ws); }

    }
}

Lastly, we just need to use this new XmlWriterSpy class in our MyHelloService class.  Here’s how:

namespace MyProject
{
   public class MyHelloService : HelloService
   {
      private XmlWriterSpy writer;
      public MyHelloService() : base() { }

      protected override  XmlWriter GetWriterForMessage( SoapClientMessage message,  int bufferSize)
      {
         writer =  new XmlWriterSpy( base.GetWriterForMessage(message, bufferSize));
         return writer;
      }

      public string Xml { get { return (writer == null ? null : writer.Xml); } }
   }
}

There you have it.

Update (April 2, 2010): As Robert pointed out, it would be nice to have the XmlWriterSpy easily downloadable in its entirety.  True!  Here it is.

, , ,

21 Comments

Plumbing in This Old House

This passage resonates with me some days:

Programming starts out like it’s going to be architecture–all black lines on white paper, theoretical and abstract and spatial and up-in-the-head. Then, right around the time you have to get something fucking working, it has this nasty tendency to turn into plumbing.

It’s more like you’re hired as a plumber to work in an old house full of ancient, leaky pipes laid out by some long-gone plumbers who were even weirder than you are. Most of the time you spend scratching your head and thinking: Why the fuck did they do that?

From the novel, "The Bug", by Ellen Ullman.  Didn’t read the book (yet).

No Comments