Command Line Tool As An Integration Point

Here is a software design pattern that you won’t find in Martin Fowler’s design pattern list. I don’t think this pattern is new or unique, but I have not seen it described or used anywhere else. At the right time and place it can be very handy, if not a little bit triggering to software purists. I had a need for this pattern three times in the past 10 years. I thought I’ll share a couple of examples. For context - all my projects were Python projects.

Imagine yourself in any of these situations:

  1. You need to write some code, but the language and library support for what you want to do is lacking, e.g. most popular library is outdated.
  2. You need to make a tool that is written in a different programming language work with your infrastructure, but one feature the tool relies on is missing.
  3. You just cannot be bothered to shave yak write an efficient solution from scratch using the libraries available.

The approach I’ve seen others take is to try and find a compiled binary library that has the required features, and integrate via ctypes or similar. Or dig into foreign codebase and implement the feature. Or spend days understanding the tooling to implement the solution. These are good approaches, but sometimes I felt like I’m solving a problem that has already been solved. Or I’m spending time doing what others could probably do much better than I could.

Luckily for me, in those three situations mentioned above, I found a command line tool existed that did what I wanted. So without much glamour, I wrote Python wrappers around those tools. One of them is open-source (though not useful anymore, but you can admire the gory code), the other two were built in private setting and cannot be shared, but I hope to give a flavour of how they worked.

poor-smime-sign

This one hails from 2015, when Apple released their Apple Wallet and I worked at a company that was selling event tickets. Perfect use-case for Wallet Passes. Apple designed the pass security very well - it required cryptographic signing. Unfortunately for us, Python3 cryptography support was lacking at that time. As we all know, implementing your own crypto is a bad idea. Thus, I found myself cornered. Luckily for me, there was a way to sign the passes using openssl command line tool. Lo and behold, I had a prototype, opensourced here, that did the signing and we shipped the feature soon after. The name was inspired by another open-source tool, that sadly I cannot recall the name of.

The tool served its purpose well, but eventually we moved to a different library that had the required functionality. I always thought this was an amusing hack that I’d never have to do again. Until…

H2oCluster

Fast forward to 2019, I was working at a company that was using H2O.ai for automated machine learning. The software was written in Java and had a thin Python wrapper around its REST API. You could launch a machine, start the software, throw data at it, tune some knobs, drink a cup of coffee, and get multiple different models to evaluate. The fun part was that our datasets were large, like 800GB uncompressed large, and the software was keen on making copy of the data after each operation on the dataframe. Add a new feature column? Copy. Drop a column? Copy. Split the data? Copy. Maybe it did try to make it efficient, but it wasn’t doing great. Incidentally, I learned how to read and analyse Java garbage collection logs and I ran the largest EC2 instances available at the time (x1e.32xlarge if I recall), and we had to get approved for that. But I digress.

H2O.ai had a nice feature, called cluster mode. In theory, you would spin a few machines up, launch the software, and say “you’re now part of the same cluster named foo.” The machines would then use multicast networking to discover each other, and connect. From a users point of view, they don’t care if it’s a single instance or 10, it looks the same. But from the resource point of view it could scale to a lot. Sounds great, until I learned that multicast networking is not supported in AWS VPCs. Luckily for me, the software had a command line option to pass a file containing IPs of all the other machines in the cluster at start-up time. You can see where this is going.

I wrote a Python script that did the following: when launched, it would use EC2 metadata service to get its own instance id, query EC2 API to determine which autoscaling group it was in, get the list of IPs of all the other instances in the group once they’re all up and running, write it to a file and pass it to the H2O.ai executable through command line.

It. Worked. Like. A. Charm. I wrote a Python library internally that would allow a datascientist to launch these clusters ad-hoc with a single Python function. It was beautiful. Eventually we retired it in favour of our own tooling, but I was very proud and again amused that this approach has worked.

Data Warehousing

More recently, I was building a data warehouse on Postgres. The data size is manageable, however, there are certain constraints about what can and cannot be moved to the warehouse. We depend heavily on Django, and avoid dealing with low level SQL. Hence, I just did not feel like digging into psycopg2 internals to write my own “replication” tool. It also seemed like a solved problem. There are command line tools that can do the job sufficiently well for our scale and needs.

I prototyped data egress and ingress using psql to test the speed. Then wrote a Python script that takes a Django queryset and generates a select and create table SQL queries. The script then executes psql -c "COPY (SELECT...) TO STDOUT" command and streams output into a file. The create table is executed inside the warehouse, followed by psql -c "COPY ... FROM STDIN". The whole thing is smart to load into a temporary table first, and then swap it with the real table. It takes minutes to move tens of gigabytes of data. I’m sure it won’t scale infinitely, but at the same time - it works and I am sure it would’ve taken me longer to write something comparative from scratch.

Ironically, as solved problems go, I found django-postgres-copy a week later. :shrug:

Final Words

Not sure what to say here, except don’t let anyone tell you that “it’s not elegant” or “not the right way to do it.” If it works, and is manageable - then it works, and that’s all that matters.


Recent articles

Sourcegraph

A personal story and instructions how to use Sourcegraph's single user container.