12 Factor Apps in Data Engineering
12 Factor apps work wonderfully for transactional systems, it takes some understanding to scale them up to big data environments.
Most of the best practices embodied in 12-factor applications align between Data Engineering and Software-as-a-Service workloads. Unfortunately, the two disciplines have a handful of key differences that make one size fits all solutions difficult between both worlds. Going through the 12 factors one by one, I’ll hopefully identify points of difficulty and concern.
One Codebase tracked in revision control, many deploys. This is an obvious one and is simply best practice. The only caveat might be how exploratory development is handled. It is exceedingly cumbersome to business users if you mandate that all queries and analytics scripts be checked into version control before execution. There needs to either be some type sandbox for these efforts, or automated processes that hide the use of version control from the end user. Either way, it is usually impractical to maintain a durable history of all code executed in version control.
Dependencies: Explicitly declaring all dependencies is a best practice, however, it may not be possible with your distributed computation environment. This most often appears with the need to ship packages to all compute nodes for execution outside of the JVM in Hadoop. Solutions to this problem do exist, but they tend to be quite cumbersome and often violate the conventions of native package handling systems. For example, shipping python code up to HDFS works, but things fail if you normally include YAML or SQL files in your code distribution.
Config in the Environment: Config should be kept out of code when it refers to a specific deployment. Config belongs with the code when it determines behavior. For example, a list of tables to ingest from an upstream database for a given deployment environment should be in code. The rule of thumb that any code should be open-sourceable at a moments notice without compromising any credentials should hold.
Backing Services as Attached Resources. A backing service is still an attached resource in the Data Engineering world, the one exception is the need to share configurations of various stripes from the backend service to client applications. Usually, some type of shim needs to be put in place to support this without unfortunate deep coupling. For example, the ability to pull configurations from a central server like Ambari can allow slightly better abstraction.
Strict separation of Build, Release and Run Stages of the application life-cycle are much more difficult from a discipline perspective with Data Engineering workflows. That doesn’t mean they can be ignored. The problem is that it the early iteration cycle for most data workloads doesn’t work well with an involved release cycle. The other issue is that most data environments that support exploratory workloads don’t make a clear distinction between what should be allowed ad-hoc and what should require more rigorous deployment strategies. Without that clear separation, users can effectively touch prod without having to go through a deployment cycle.
Data Processes are never stateless. It is impossible to consider a data process stateless unless you can afford to lose the entire transaction on the backing system along with its management code. This is a fundamental distinction between stateless web apps and all other systems. You can ape the appearance of a stateless server by effectively declaring your own stateful backing service, but the stateful service must still exist and it will be independent of the web app. Usually, by enforcing statefulness arbitrarily at the point of the web application, you are increasing the complexity of the infrastructure without significant benefit. Fundamentally, the user-facing tier in this architecture is not the point at which scaling problems will emerge.
Port Binding. This is a best practice, but it should be noted that many of the services that might be needed in a Data Engineering environment may not cleanly handle port binding.
Concurrency Scaling out via a process model should be the first choice in most cases. But don't reinvent the wheel here. A distributed system often benefits from an investment in more robust concurrency models. I'm not challenging the need for a simple concurrency pattern, but coordinating an Actor model often pays dividends. Don't rule out Akka, Erlang, Elixir, and Beam when you might actually need them. Even so, the fundamental rule still applies, you'll need to scale beyond one process, one node eventually. You should plan for that first, before getting clever within the process.
Disposability. Graceful shutdown is perhaps the biggest problem in Data Engineering. It’s critical that jobs fail into a clean state in the event of failure. Unfortunately, we have a high cost for many running processes. It's critical that things are componentized in such a way as the cost is controlled. Losing a 12 hr job sucks, corrupting data is inexcusable.
Dev / Prod Parity I agree completely with a desire to achieve dev/prod parity in systems. Unfortunately, data environments don't map cleanly into continuous delivery environments. The obvious first reason is the scheduling of batch processing. A job that only runs once a day or once a week can only be deployed on the same schedule with any effect. Deploying more often is effectively a no-op. The other critical issue is ensuring you can develop proper business logic for analytics workflows without access to prod data. This can be a difficult transition for some teams but is absolutely critical in a number of security scenarios.
Logs. Be prepared for Verbosity. A lot of Big Data tools have special log handling to cope with Volume. There's nothing wrong with routing them to an ELK stack or aggregation service but be prepared for the volume (and cost). It's often necessary to implement supplemental services to filter or redirect logs within the deployment environment.
Admin Processes One of, repeatable scripts are critical for sanity in deployments. These are perhaps more critical in a big data environment where your infrastructure may not be fully disposable. Disposable infrastructure enforces the need for this, Long running infrastructure encourages laziness. This is the biggest problem. The pain points are somewhat masked until the smell is entrenched in your operations. In my experience purging yourself of the stink is invariably painful and invariably frustrating.
The 12-factor app is undoubtedly a great model for making smart choices in the design of web systems. Architecturally, it can inform the design of systems far removed from standard web app development and deployment. As often as I wish my Hadoop cluster was a bit cleaner in some ways, these ideas need a bit of adjustment before we push them to our big data environments.
If you're interested in diving deeper into 12-factor apps, I recommend starting here: https://12factor.net/admin-processes