Latest Updates

Querying Avro Files With Presto. It is possible!

2018-08-07T00:00:00Z

Presto’s Hive connector proudly declares that it supports AVRO. However, if you’ve ever tried to query a Hive built AVRO table from Presto, you may have gotten a pretty cryptic error message. cannot find field id from [0:error_error_error_error_error_error_error, 1:cannot_determine_schema, 2:check, 3:schema, 4:url, 5:and, 6:literal]

If you try searching for solutions to this problem, or Avro Presto issues in general, it’s not particularly promising. I found this: https://groups.google.com/forum/#!topic/presto-users/uPApa6ih2B0

"If you can, I would simply avoid this format entirely."

In this instance, the engineering team had given up on the problem 6 months ago, and lots of forum posts about how Presto's AVRO support seems fundamentally broken. It turns out that there’s a pretty simple fix.

Presto can’t deal with schema’s stored in referenced external files. Instead, you can place the schema in the create table statement:

CREATE EXTERNAL TABLE people ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'=' { "type" : "record", "name" : “people", "namespace" : "testing.hive.avro.serde", "fields" : [ { "name" : “first", "type" : "string", "doc" : “first name" }, { "name" : “last", "type" : "string", "doc" : “last name" }, { "name" : “record_id", "type" : "int", "doc" : “record id" } ] } ‘)

If you need to extract the schema from your AVRO files, it’s pretty simple. Just use avro_tools.jar: https://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/

Momentum, Resilience and Dave Ramsey

2018-08-06T00:00:00Z

For those of you unaware, Dave Ramsey runs a popular provider of financial advice based out of Tennessee. While he operates in a radically different domain, he can provide a useful model for managing the resources within a startup or business environment as well. He’s known for incredible charisma, practical advice, and his baby steps methodology. Despite what a demonstrable track record of helping people improve their financial condition, he is quite controversial among many financial advice circles. In fact, you can find entire sites dedicated to denouncing him. The reason for this is quite simple; he advocates a collection of practices that go against conventional financial wisdom. If you follow his rules, the mathematical models show you losing to more traditional financial methods. Strategies like eliminating borrowing, paying off your mortgage early and maintaining sizeable liquid cash reserves represent substantial opportunity costs compared to having as much money at work in the market as possible and leveraging debt and usage incentives efficiently.

But Ramsey’s reasoning is sound and applies to realms of behavior and strategy well beyond household finance. In particular, Ramsey’s plan prioritizes momentum and resilience. He focuses on building momentum by front-loading the quickest possible wins. Just like startups embrace a "ship weekly" mentality. It will take you years, maybe decades to fully complete his program, but you have constant wins and reinforcement early in the process. Paying off small debts first is the equivalent of getting validation for a POC or getting your early adopters happily vesting in the platform. There are many advantages to hitting the market with a fully fleshed out product offering that appeals to the broadest possible audience, but it’s the longest time to market and time imposes risk. Having small early wins increases the likelihood of Maven users starting to advocate your product and increases the rate of learning for the product team. Knowing something works gives you the confidence to walk further down the path. Not having confirmation that something is working will cause you to devalue your work or stray down the wrong path.

Similarly, Ramsey emphasizes resilience. Instead of focusing on the highest return first, Ramsey encourages people to invest in safety early and often within the plan. By maintaining and growing a safety net, you reduce the risk of unforeseen events. In Ramsey’s words, you turn a “catastrophe into an inconvenience.” He gives the model of a car failure being either a bad day if you have the cash for the repair to a major catastrophic event if it leaves you unable to execute a fix and without transportation to work. Good capital management and having sufficient runway surface frequently appear in every piece of startup advice for the same reason. The fundamental rule for any business is “Never run out of money.” You can recover from every other sin or crisis provided you have the cash reserves you need. In most cases, maximizing flexibility and resilience means minimizing leverage. If I’ve pulled money from the future into the present, I’m going to need to repay that money. I can’t deviate from that obligation, and it can constrain my options considerably. Perhaps the most significant recent case study of the consequences placed on business from having a substantial debt obligation is Toys R Us. Most analysts agree that the firm would have made it didn't carry such a massive debt load.

In general, I find the advice to push money into the future and the power of leveraging early wins to be compelling for both personal and business strategy. We have very different goals and time horizons as humans and businesses. As humans, we have significantly fewer options to distance ourselves from liability, and in most countries have a very slim safety net. , however, have significant liability protections and due to the agency problem, much lower perceived risk by those making decisions. A business also frequently start with a liability concerning investor equity that never goes away. Managing this liability skews the equations a fair bit in decision making. Effectively, it is impossible for a business to lower its obligation below a certain threshold given investor demands for ROI. Even a sole-proprietorship operates under a significantly vested investor interest, even if she isn’t aware of it in those terms.

An Introduction to Technical Capital

2018-08-05T00:00:00Z

We spend a fair amount of time talking about technical debt. While there’s a fair bit of content to discuss around these issues, today I’d like to bring attention to the flip side of this issue. I call this concept, Technical Capital as it is effectively the inverse of technical debt. Think of Tech Cap as the accumulated value of your technology investments. An easy way to think about it is all the stuff you’d need to undo or redo if you were to implement a ground-up rebuild.

Superficially, this would be everything Joel Spolsky wrote about rewrites quite eloquently 18 years ago, https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/. Spolsky focused on the layers of bug fixes and domain knowledge that accrue into an application’s code base over time. But the full scale of technical capital extends significantly beyond the accumulated logic in your production application. In most organizations, the actual size of your technical capital extends into the 80% of your code that runs the business. This drives a tremendous amount of value for the company. This code runs your accounting, marketing, and logistics systems. It ties your HR and operations systems together and provides the full high altitude view of the company for executives.

Unfortunately, many people inside the tech org of a company are blind to most of the technical capital at work in the firm. Recently, I presented the Data Lake architecture to the product development teams within my organization. One senior dev who’d been with the company for over 2 years asked quite bluntly, “Why do we bother building a data lake.” He had no clue that over 70% of the code at work within the company existed outside of the primary application repository on GitHub. At the time, I was utterly blindsided by the question. The value of the Data Lake seemed well communicated within the org, but looking around the room, I realized that product development didn’t have a clue on the value. Since then, I did an analysis of lines of code committed in various languages across all of the corporate repositories in, and I’ve started including a slide in all of my roadmap and introduction presentations. Jaws drop when I demonstrate that the language of the production application is barely in second place for popularity behind Python and that R is almost equal in scale.

Another major issue when thinking about Technical Capital is how it maps into the company’s human capital. In a mature organization, especially in a sophisticated organization that embraces a self-service operating model, Many of the people with significant technical awareness don’t appear in a superficial glance at the org chart. Instead, they make up your organization's shadow dev org. This undocumented pool of expertise will need retraining if you ever engage in a significant rewrite.

I’m still working on my personal theory of technical capital, but I thought this would be an excellent introduction to the concept. This needs to be a factor in any significant technical decision. In my experience, most technologists are well aware of this to some degree. In particular, those with a technical background tend to be much less likely to cite “Sunk Costs” in arguments around pivoting from or otherwise abandoning previous technology investments. To be clear, I’m not saying to never discard technical capital in the name of innovation or technical debt refactoring, I’m just saying that you probably have significantly more investment to return to a steady state than you reasonably expect without a much more in-depth examination of your code and your Org.

Balancing Elegance and Operations in ETL

2018-08-04T00:00:00Z

Imagine the perfect data warehouse for your business. It probably consists of a set of elegant Stars or even Neutron Stars for each of your critical business entities organized into a relatively simple snowflake. It would be easy to manage, all of your calculations would be in one place. There’d be a single point of access which would make governance and change management easy. From a software engineering perspective, we never risk the duplication of code. Unfortunately, this architecture introduces a fair bit of risk that’s frequently unanticipated.

Let’s compare this design to an alternative strategy of purpose-oriented datamarts. You can see the differences between the two architectures below:

In Strategy A, we produce smaller, much more focused datasets. There’s a fair bit of redundancy between them, which is undesirable. Just how undesirable depends on your technology infrastructure and governance requirements, but let's take the extreme example of a pure GUI or SQL-driven ETL framework and extreme governance requirements. In this case, we cannot share logic between individual pathways. At best, we’ll copy & paste between each data mart when a revision to the business logic is needed, at worst we’ll lift and shift logic between pathways in some way. These strategies increase the risk of errors, and we probably miss a pathway or make some other mistake in the process. In a best-case scenario, our code is procedural/functional instead of declarative. We also might have the benefit of shared libraries, shared DAG nodes and unit tests. All of these strategies dramatically reduce the maintenance burden and risk. Complexity increases, but we’ve mitigated it a fair amount at the cost of somewhat higher setup and initial development costs.

However, the bigger question is, why do we want to do this in the first place. What is the benefit of ditching the monolithic data warehouse? Perhaps the most substantial benefit of creating discrete data warehouses is the reduced size of each warehouse. Reduced size won’t make a significant impact on query times if you store the underlying data in a columnar format but, it reduces computation overhead and can hopefully improve delivery times. It also frees us to prioritize specific workflows in the event of scarcity. Legal compliance reporting, operations, and customer needs trump accounting, data science, and other analytics workloads. Increased parallelization of the datamart builds an option; provided our infrastructure can support it. We also gain the ability to manage scope a fair bit. After all, it’s easier to deliver a small, well-defined set of features for a small number of users. These smaller data marts load into high-performance reporting databases downstream easily as well.

So, which strategy is right for you? It depends on your organizational culture more than anything. Creating separate data marts can be ideal for firms that have the engineering maturity to manage the diverse code paths and enforce consistent data definitions. If you don’t have the maturity to manage the complexity, you’re looking at the high risk of swampiness using this strategy, and you need to tread carefully.

Why I Built This Site With Lektor And Would I Recommend It?

2018-08-03T00:00:00Z

When I built the new DonAlbrecht.com I had a few primary objectives:

I wanted to keep costs as low as possible
I didn’t want to run a server
I didn’t want to avoid comment maintenance.
I still wanted full control over the theming and content of the site.
I needed robust, simple text backups for all content

I think the first one is pretty self-explanatory, I don’t know anyone who wants to keep their costs as high as possible. The next two spring from lessons learned from over a decade of running Wordpress and other PHP based CMS’s. Even with the low cost of Amazon, I didn’t want to take on the burden of constant security updates for an open Linux server hosting a website, and although Plesk based, shared hosting environments are wonderful, they still leave you open to a pretty large surface area for security and spam issues. The final point stems from a desire to preserve URLs over time. If I ever want to change the implementation of the site in the future, It was critical that my content existed in some external repository. Dumps of the underlying database or other complex export formats don’t work in my experience. In a perfect world, the content would consist solely of markdown or some other robust, easily parsed and transformed format.

I spent some time investigating alternative hosting solutions or even going with a shared publishing option like Medium. In all cases, the hosted solutions kept coming up short due to the tension between keeping costs down, long-term flexibility and the durability of my data.

Eventually, I came to the idea of a static content delivery system. Content would be stored in plain text in the source of the site and committed into git for durability. I also would leverage a simple file format, so changing to a different platform wouldn’t be a painful exercise in the future. Jekyll was an obvious potential choice. I also explored Hexo, Hugo, and Pelican. Eventually, I settled on Lektor. My daily driver programming language is Python most days, which makes me partial to a python based solution like Lektor or Pelican instead of a Javascript or Ruby implementation. Lektor also had the benefit of a GUI editor and staging server that makes life a fair bit easier with content development. The deciding factor, however, was Lektor’s strong support for custom content types.

Mostly, the workflow for this site is as follows. I write the original content in Evernote, from whatever device I happen to have handy. I then cut and paste this content into Lektor running on my office desktop, set up all of the metadata and preview the content locally. I then check the resulting files into git and publish the site to s3. I host the entire site in s3 and distribute content via CloudFront. Total cost is about $ .30 per month.

Half a year later, I’m relatively happy with my decision. That’s not to say everything is roses. Lektor has worked well for creating the primary site. Building out the templates has been a learning curve, and I still don’t have a functional AMP implementation. Perhaps the biggest drawback is the complete lack of interactivity in the setup. Not supporting comments is nice, but having no feedback system is a bit of a downer. I’m hoping to implement some contact form soon.

Overall, Lektor has been an excellent platform for the site. It’s simple, flexible and surprisingly user-friendly. It’s not something the average user can leverage to get up and running without some help, but an excellent platform for anyone with basic cloud & HTML skills.

My Cancer Journey

2018-08-02T00:00:00Z

In the spring of 2013, my father died of cancer. It was fast, untreatable, and unexpected. In fact he went from a trip to the emergency room to deceased in about 4 weeks. 2 years later, I found a lump in my neck. I spent the next 2 months working my way through a progressively more intense series of medical tests culminating in a surgical biopsy of the lymph nodes in my neck. It turns out, I suffered from Acute Lymphoblastic Leukaemia. Luckily, I responded to chemo almost immediately; unfortunately, I’m still not done with treatment as I write this.

The treatment for ALL is both intense and long and I’ve dealt with the side-effects of chemo at various levels of unpleasantness since 2015. Three years is a long time and there are many stories to tell. Today, I thought I’d give the simplest introduction to my treatment and my life the past several years. The treatment protocol I’ve experienced breaks down into three distinct phases. Each phase has had marked and rather distinct impacts on my day to day life.

Phase 1 was the shortest and most intense phase. It’s usually referred to as induction and the primary goal is to induce remission quickly and efficiently. Given the intensity of the treatment and risk of side effects, I spent these 30 days inpatient. This was definitely a good thing. By diagnosis, my spleen had enlarged dramatically by collecting innumerable leukaemia cells. Shortly after the start of chemo therapy, the cells died so rapidly that my spleen collapsed and ruptured. It happened so quickly I barely had time to call the nurse before face planting on the floor. I wound up needing emergency surgery to repair 4 bleeds.

Phase 2 is called Consolidation. The goal is to use a variety of chemo processes over 6-8 months to kill any lingering cancer that may exist. The treatment is still intense, but you can live at home and actually go back to being productive. In my case, infections became the bane of my existence. For most of this phase, my immune system effectively didn’t exist and I existed in a narrow bubble between my house and the cancer ward. The rest of my contact with the outside world was entirely electronic.

Phase 3 is my current phase of treatment and by far it’s been the easiest to cope with. This phase is referred to as Maintenance and the goal is pretty simple. In case any leukaemia cells managed to stay dormant through both of the previous phases, we’ll get them on their way out. It still sucks, and you still take chemo meds daily, weekly, and monthly, but your hair grows back and you usually have an immune system. There were a few times where my immune system collapsed and I had to retreat to my bubble again, but these were minor. I also was pretty limited in travel, even simple air trips could leave me incapacitated with some bug. If I get sick, recovery takes weeks instead of days.

I have a few more months of phase 3 remaining and I’m trying to wrap my head around what’s next. For 3 years, ambition has been the least of my concerns. It’s hard to prioritise career when you aren’t sure how much time you have to pursue your dreams.

Welcome to Blogust

2018-08-01T00:00:00Z

I’ve not been as productive at populating this new site as I’d hoped. There’s plenty of legitimate reasons, but I’m just not making the progress I’d hoped and intended when I started the new site. I write for this site as a way of putting my thoughts into a more concrete form and potentially as a form of self-promotion. I don't write this blog for direct monetary gain, and it's usually pretty low on my daily priority list. Let’s face it, donalbrecht.com had all of 13 unique visitors last week. I’m not overheating any servers anywhere. I consider writing incredibly valuable over the long term; I just succumbed to the usual human failing of occasionally prioritizing short-term and pressing concerns over long-term, more indefinite returns.

Last week, I realized 2 things. First, I hadn’t done any reinvestment in this site since April. Second, One of the blogs I follow announced an attempt to deliver a post every day for August. If you’ve read this far, you realize I’m about to announce my intentions to do the same. Unfortunately, I need to be respectful of the reality of my life and the technical limitations of this site, therefore this is not a commitment to publish a new post every day. Instead, I intend to publish a total of 31 articles between the start of the month and midnight on Saturday the 31st.

Towards that end, here are the rules I’m setting for myself.

There must be a new post for every day on the month, even if they don’t appear daily.
The posts for a week must true up by the end of the day on each Sunday (5th, 12th, 19th, 26th & 2nd)
The average length of posts needs to exceed 500 words.

Naturally, given the rather large volume of production, I’m committing to, the content should skew much more towards the narrative and opinion and away from technical experimentation. I also expect the quality of the posts will fall a bit low; I’ll likely be rewriting them at some point in the future with more polish and perspective. Luckily, given my traffic, I’ll honestly just be building a long tail of content for the search engines and not risking my reputation too broadly.

Lastly, I need to point out that this has nothing to do with the previous UN Vaccination awareness campaigns (http://blogust.org/about-blogust-2016/). I wish I could claim some altruistic purpose for the exercise, but selfishly, It's only intended to keep me moving forward and making progress.

12 Factor Apps in Data Engineering

2018-05-02T00:00:00Z

Most of the best practices embodied in 12-factor applications align between Data Engineering and Software-as-a-Service workloads. Unfortunately, the two disciplines have a handful of key differences that make one size fits all solutions difficult between both worlds. Going through the 12 factors one by one, I’ll hopefully identify points of difficulty and concern.

One Codebase tracked in revision control, many deploys. This is an obvious one and is simply best practice. The only caveat might be how exploratory development is handled. It is exceedingly cumbersome to business users if you mandate that all queries and analytics scripts be checked into version control before execution. There needs to either be some type sandbox for these efforts, or automated processes that hide the use of version control from the end user. Either way, it is usually impractical to maintain a durable history of all code executed in version control.
Dependencies: Explicitly declaring all dependencies is a best practice, however, it may not be possible with your distributed computation environment. This most often appears with the need to ship packages to all compute nodes for execution outside of the JVM in Hadoop. Solutions to this problem do exist, but they tend to be quite cumbersome and often violate the conventions of native package handling systems. For example, shipping python code up to HDFS works, but things fail if you normally include YAML or SQL files in your code distribution.
Config in the Environment: Config should be kept out of code when it refers to a specific deployment. Config belongs with the code when it determines behavior. For example, a list of tables to ingest from an upstream database for a given deployment environment should be in code. The rule of thumb that any code should be open-sourceable at a moments notice without compromising any credentials should hold.
Backing Services as Attached Resources. A backing service is still an attached resource in the Data Engineering world, the one exception is the need to share configurations of various stripes from the backend service to client applications. Usually, some type of shim needs to be put in place to support this without unfortunate deep coupling. For example, the ability to pull configurations from a central server like Ambari can allow slightly better abstraction.
Strict separation of Build, Release and Run Stages of the application life-cycle are much more difficult from a discipline perspective with Data Engineering workflows. That doesn’t mean they can be ignored. The problem is that it the early iteration cycle for most data workloads doesn’t work well with an involved release cycle. The other issue is that most data environments that support exploratory workloads don’t make a clear distinction between what should be allowed ad-hoc and what should require more rigorous deployment strategies. Without that clear separation, users can effectively touch prod without having to go through a deployment cycle.
Data Processes are never stateless. It is impossible to consider a data process stateless unless you can afford to lose the entire transaction on the backing system along with its management code. This is a fundamental distinction between stateless web apps and all other systems. You can ape the appearance of a stateless server by effectively declaring your own stateful backing service, but the stateful service must still exist and it will be independent of the web app. Usually, by enforcing statefulness arbitrarily at the point of the web application, you are increasing the complexity of the infrastructure without significant benefit. Fundamentally, the user-facing tier in this architecture is not the point at which scaling problems will emerge.
Port Binding. This is a best practice, but it should be noted that many of the services that might be needed in a Data Engineering environment may not cleanly handle port binding.
Concurrency Scaling out via a process model should be the first choice in most cases. But don't reinvent the wheel here. A distributed system often benefits from an investment in more robust concurrency models. I'm not challenging the need for a simple concurrency pattern, but coordinating an Actor model often pays dividends. Don't rule out Akka, Erlang, Elixir, and Beam when you might actually need them. Even so, the fundamental rule still applies, you'll need to scale beyond one process, one node eventually. You should plan for that first, before getting clever within the process.
Disposability. Graceful shutdown is perhaps the biggest problem in Data Engineering. It’s critical that jobs fail into a clean state in the event of failure. Unfortunately, we have a high cost for many running processes. It's critical that things are componentized in such a way as the cost is controlled. Losing a 12 hr job sucks, corrupting data is inexcusable.
Dev / Prod Parity I agree completely with a desire to achieve dev/prod parity in systems. Unfortunately, data environments don't map cleanly into continuous delivery environments. The obvious first reason is the scheduling of batch processing. A job that only runs once a day or once a week can only be deployed on the same schedule with any effect. Deploying more often is effectively a no-op. The other critical issue is ensuring you can develop proper business logic for analytics workflows without access to prod data. This can be a difficult transition for some teams but is absolutely critical in a number of security scenarios.
Logs. Be prepared for Verbosity. A lot of Big Data tools have special log handling to cope with Volume. There's nothing wrong with routing them to an ELK stack or aggregation service but be prepared for the volume (and cost). It's often necessary to implement supplemental services to filter or redirect logs within the deployment environment.
Admin Processes One of, repeatable scripts are critical for sanity in deployments. These are perhaps more critical in a big data environment where your infrastructure may not be fully disposable. Disposable infrastructure enforces the need for this, Long running infrastructure encourages laziness. This is the biggest problem. The pain points are somewhat masked until the smell is entrenched in your operations. In my experience purging yourself of the stink is invariably painful and invariably frustrating.

The 12-factor app is undoubtedly a great model for making smart choices in the design of web systems. Architecturally, it can inform the design of systems far removed from standard web app development and deployment. As often as I wish my Hadoop cluster was a bit cleaner in some ways, these ideas need a bit of adjustment before we push them to our big data environments.

If you're interested in diving deeper into 12-factor apps, I recommend starting here: https://12factor.net/admin-processes

Stepping Away From Facebook

2017-12-04T00:00:00Z

Over the summer I started to notice just how much time I was spending on Facebook. Partly, I think it was my reading of Ben Thompson's Stratechery that made me most acutely aware of it. Facebook capitalizes on attention; it succeeds on mobile by giving you something to do in stray moments. It was there on the train, in the elevator, sometimes even on the toilet. 2 years ago, it was a godsend. Facebook's feed provided me a lifeline when I was trapped in the house and too sick to do much of anything.

But, the drumbeat against social media from researchers has been pretty loud and I've heard it. In short, it seems Social Media is usually a net negative for you as an individual. It encourages a mutant form of keeping up with the Joneses and creates the empty illusion of connection with others. Yet it doesn't provide most of the benefits of true connection. It's shallow and often circumvents deeper conversations.

So mid-summer, I decided to wean myself off of Facebook and I took the simplest first step. I uninstalled the app from my phone. What shocked me, is that this didn't work. Within a few days, I was spending more time on Facebook.com from my desk and I had started visiting the mobile site on the train each day. After a few weeks, I realized I needed to take more drastic action. So, I turned on parental controls on my phone and explicitly blocked facebook. I continued to allow messenger, but this second step seemed to work. I started spending more time chatting with people on my phone, reading, and writing. My incessant news consumption faded away in favor or my morning flash briefing (thank you Alexa) and a quick evening skim of Quartz.

I still consume my RSS feeds every few days, and work to stay on top of tech. After all, that's the Red Queen my career is tethered to. But the net impact on my mental health is tangible. I'm more present, more reflective. The regular writing is definitely good for me. More importantly, the deeper conversations have really helped me feel grounded and connected in a way the shallow stream of facebookland hasn't. Do I still read facebook, sure! I'm probably skimming it every day or two for a couple of minutes, but I'm looking for the social news, weddings and baby pictures mostly; I haven't missed the listicles, meme's and advertisements.

Treat Your Data Like Caviar

2017-11-15T00:00:00Z

Recently, I was served 11 Madison’s park take on eggs Benedict. A truly decadent dish involving quails egg, ham, asparagus, and caviar all served in an attractive Art Deco tin.

True to tradition, the utensil was an elegant Mother of Pearl spoon. The narrative behind the delicate spoon is that metal would damage the flavor of the premium caviar. But, when you think about it, this tradition doesn’t make the most sense. Surely, all the time the caviar spends in contact with the tin it was served in far outweighs the few seconds it would be in contact with my spoon.

The CAVIaR model for Data Governance is in many ways like the traditions around sturgeon roe. We want to protect and ensure the value of something precious and scarce, good data. I can’t take credit for developing the CAVIaR model, that credit goes to Anne Ahola Ward and her “The SEO Battlefield” Book. A quick google for the term doesn’t turn up much, which is a shame. Her model is quite clear and well-formed.

What is the CAVIaR model?

In short, the CAViaR model ensures that data is Complete, Accurate, Valid and Restricted. That means you’ve done the due diligence and have the proper processes in place to insure not just the integrity of your data, but the security and usefulness as well. We’ll dive into all four qualities in short order.

Complete

Is your data truly representative? Have you eliminated gaps? Did you work to remove selection bias from the collection process? Gaps are easy to find, Selection bias is much more insidious.

Accurate

Are you measuring what you say you’re measuring? Are you over scrubbing the data? What are the gaps in your measurement process that might miss or exaggerate data points? Are you properly scrubbing out noise or are you accidentally eliminating significant outliers?

Valid

Did you really measure what you say you measured? Or are you actually tracking a roughly correlated measure? Do you have too much noise?

Restricted

In the current age, this is the most overlooked requirement. Are you properly controlling who can view and see the data? Do you have appropriate encryption on the data at rest? Are you tracking access? How will you know if there’s a data breach? What data do you have, do you know where all the PII, PCI and HIPAA sensitive records are in your infrastructure?

All of these tests are a necessity. They should provide the foundation for any data operations you perform. In the modern enterprise, data is as precious as fine osetra caviar and should be given the same level of respect and tradition.