Starting off a series where I hope to write more as I get back into the MLOps space after a brief stint cosplaying a DevOps + Data Engineer at my previous role.

Now I’m back working on things that I’ve worked on pretty much all my career, MLOps. And I’m kinda excited about the prospects of working towards a platform-oriented way of serving models and services at Nomad Health.

And with that comes the roadmapping of the vision and outlining the goals for the ML platform. And more contextually, to build or to buy.

The book Designing Machine Learning Systems by Chip Huyen narrows the scope of considerations down to 3 broad categories:

  • The stage of the company
  • Competitive advantage for the company
  • Maturity of the available tools

I’m not sure if I’m very good distilling things on my mind like Chip does. It is a remarkable skill to have for sure. One I’m very envious of I might add. But as it stands, I’m more of a story person. So instead what I’m going to do is try to tell stories of my past experiences. And hope that you, the reader, come away with something useful.

There are essentially two stories:

And yes, I do know how to count.

The third one is something I added for fun to justify my own existence in life. Not to be taken seriously. But perhaps might give you some life lessons or help you in your quest for finding the build-vs-buy answer as well. I don’t know, read on and let me know :P

Issue 1: Build vs Buy decisions for MLOps

Now, a LOT of the build vs buy considerations for MLOps is similar to what you would find for generic software engineering. However, the state of MLOps tooling in general is going to make this a little bit interesting.

And the super simple, really short answer is: IT DEPENDS.

A very senior engineer like answer. It depends. But it’s true.

And also, the stories might say everything but it’s not really everything but like 90% of things.

Story 1: The One Where We Build Everything

This company didn’t really set out to build everything on their own. But kinda had to. So we’ll call this company, Stadia. No reference or callback to Google’s Stadia.

The number one reason why they had to build things on their own is because of data constraints from their customers. The very tight data contracts meant that data of any form cannot be sent to a cloud/platform provider. This obviously meant that there were a lot of engineering cycles invested into building and re-building components of the MLOps system.

The first-time ever you build out a feature store, you build one out that’s very simple. And then you outgrow your needs and you need to re-architect and re-build to accomodate the new scenario. And similarly for other components of the system.

Team Size

Now Stadia being a relatively bigger company, could afford to spend engineering cycles on these projects. Even spawn enough engineers to staff a complete MLOps team. Some more than the PaaS startups in the space. And that is an incredible differential to the way they do things and the way approach build-vs-buy.

Now contrast with a much smaller engineering team. If you have 2-3 engineers doing a lot of feature work for the MLOps platform and you have the same amount of lift. You’re creating a recipe for burnout. Doing more with less is actually not something that’s possible.

I really should have more memes at this point. Damn it, maybe the next one will have memes. I promise.

Well, we’ll see.

Team Skillsets

One other thing is the differential in the team skill and coverage. Now most of the tooling that’s built doesn’t have very good abstraction to cleanly separate the platform and end-user APIs. And the leakage of abstraction means that data scientists have to learn much more of the stack than intended. Now this is true for software engineering and data science. Nobody should expect SWEs or Data Scientists to learn Kubernetes.

The amount of time spent in learning a new layer of the stack and wrestling with how it works and more importantly how it fails is a monumental effort. Debugging simple things is now a huge pain point. Iteration cycles slow, releases take longer.

Leaky abstractions are not specific to in-house solutions. Most tools and services in the MLOps space do expose some of the underlying infrastructure to the end-user. The maturity aspect that Chip talks about applies here. But one can hope that as the space matures the abstractions become better. But this is definitely not a priority for in-house solutions where building a business takes precedence over nice-to-have, quality of life projects.

The overall cognitive strain on the team is a deterrent to doing productive work. Adopting or building tools that fit the team’s skillset is a huge factor in build-vs-buy decisions.

Story 2: The One Where We Buy Everything

Now imagine a super small team, mostly with Data Scientists and Machine Learning Engineers. Not a lot of people with Infra experience. Now this is when you buy pretty much everything.

Let’s call this company, Acadia. Kinda rhymes with Stadia right? So the data science team within Acadia is pretty new. The pressure is on the team to produce value to the business. So first you establish ROI for the business. And then you look to expand the team and start prioritizing quality of life for them.

At some point issues around iteration cycles and slow model development and releases is going to prevent the team from scaling out. The scaling out opportunity to expand and address more of the business problems is there but the platform doesn’t support it.

The 70% Problem

Now you need to hire for more engineers to begin to build out a platform. Now this is a curious situation because you’re not really buying everything at this point. You already have a slew of tools and services you’re using. But there’s no sense of a unifying platform to scale out the impact of data science across different aspects of the business. This sadly the state of MLOps services out there. Even if you buy your way through every problem, there’s still that gap that you need to fill out for the last mile.

I call this the 70% problem in MLOps. Most tools out there only solve 70% of a subset of a problem. This is primarily because MLOps is still very young. There’s so much opportunity for startups and that’s pasrtly why you see good funding rounds for companies attempting to solve problems in this space.

The fact that you can buy a solution for every piece of the MLOps puzzle and still not get close to where you want to be. You still need Platform Engineers to connect the last mile. It’s a very long mile.

Story 3: The One Where I Start Working at the company

Now that brings us to where Platform Engineers come in to the picture. It’s kind of similar to the Software Engineers who love working in the space between product-market fit and scale out.

The Platform Engineers in MLOps make more sense and a very visible impact when you have a small Data Science team who have established business value but are trying to scale-out to broaden impact across the organization.

I realized I was kind of made for this. Hulk, Made for This

Ok I did it, I included one meme for y’all.

The scale-out

So it’s going to a healthy mix of trying to see which components of the MLOps system to buy and which ones to build. Would owning parts of the systems give the company an advantage over the competition? Would buying a simple solution or using a cloud-provider service ease the overhead of maintenance?

And thinking about improving velocity and the number of models in production is the sole goal here. And it’s honestly kind of exciting to be just focussing on developer velocity and happiness.

This section is kinda open-ended now but I will come back to it every 5-6 months to update.

So until the next one, happy building/buying and bridging!

P.S Didn’t even proof-read this, we’re just YOLOing here. So no apologies or anything, nobody’s paying me for this, I’m just having fun writing :D

P.P.S And if you learn something in the process, I’m happy.