3 reasons why Software Engineers have on-call duties, but Data Scientists don't.

And it's not because Data Scientists have easier problems to solve

and

Nov 26, 2024

Ever wondered why software engineers groan about their on-call rotations while data scientists seem to sleep peacefully through the night?

This question came up during a chat with my counterpart Principal Engineer a few weeks ago. He’d never worked with data scientists before and naturally assumed we’d have on-call duties. My response? A polite but firm ‘nope.’

Here goes a personal view of why I pushed back on getting our data scientists on the on-call rota. First, data scientists build systems to fail gracefully. Second, problems like model drift aren’t 3 AM fixes. And third, ML issues often need team-wide coordination.

In this article, I’ll dive into these three key differences and explore what makes on-call duties unnecessary—and sometimes outright counterproductive—for data scientists. I would also appreciate

Alberto Gonzalez 🚢

view as a software engineer who also writes about health - and health is probably the opposite of what on-call duties provide.

View #1. Data Scientists design their systems for graceful degradation.

As a data science lead, my teams handle production-grade traffic through our machine learning models in two ways: batch inferencing and real-time traffic. Let's focus on real-time scenarios, such as a system that ranks hotels when users search by city and dates.

When things go wrong:

For Data Scientists: Our systems are designed with built-in fallbacks. If our primary model (say, v54) starts showing high error rates due to feature store connectivity issues, the system automatically falls back to a simpler, more robust "failsafe" version. This failsafe might not be as sophisticated as our latest model, but it's battle-tested and reliable. No midnight phone calls needed - the system handles the degradation gracefully.
For Software Engineers: Issues often require active human decision-making and intervention. When a service fails, someone needs to assess the situation, determine the appropriate fix, and implement it safely. There's rarely a "flip the switch to failsafe" option - each problem might need a unique solution.

Our ML systems are built expecting things to go wrong and have predetermined fallback strategies, while software systems often need human judgment to resolve issues.

View #2. The nature of debugging differs significantly.

Software issues often have clear symptoms – error logs, monitoring alerts, and user reports provide immediate feedback. Data scientists face murkier challenges: model drift, data quality issues, or statistical anomalies that require careful investigation and validation. These aren't problems best solved at 3 AM with bleary eyes.

For example, what happens when model performance degrades. Is it because user behaviour has changed? Has data quality dropped? Or is the model itself no longer capturing important patterns? Answering these questions requires analysing trends over days or weeks, running A/B tests, and collaborating with business stakeholders to understand market changes. Unlike a crashed service that needs immediate restoration, these investigations benefit from thorough analysis during business hours when we have access to the full context and team expertise.

View #3. The fixes for an issue regarding ML have a longer lag.

Even if a Data Scientist was on-call and was able to spot a data issue or saw that an update pipeline was stalled, these problems tend to have broader dependencies in an organisation. Data owners will probably won’t be on-call either, so there is no point knowing that an upstream ETL which you have a dependency on, failed.

This is why ML system issues are better handled during business hours with a coordinated response, rather than through middle-of-the-night firefighting. The "fix" usually isn't a quick code deployment - it's a series of collaborative decisions and actions across multiple teams.

The takeaways

Data Scientists build ML systems to handle failure gracefully, using predetermined fallback strategies that reduce the need for midnight interventions.
Debugging ML problems requires deep analysis and collaboration—tasks best handled during business hours when the full team is available.
Fixing ML issues often involves upstream dependencies and broader organisational coordination, making on-call duties less effective for Data Scientists.
The key difference lies in system design: Data Scientists prepare for uncertainty, while Software Engineers often need immediate, human-led responses.

Senior Data Science Lead

Discussion about this post