Presented by Atlassian
We discussed maintenance previously, but we mainly discussed it in terms of how we design and develop software.
Maintenance isn't just building new features; it's about keeping the existing ones running smoothly.
Software operations is a very broad subsection of Software Engineering.
"The goal of software operations is to ensure that applications function reliably, are available when needed, and continue to meet user requirements over time" - IEEE Computer Society.
Aspects of software operations include:
At a basic level, what you use to check whether software is working in production is the same as what would use if you were running it locally - logs and errors.
How you access the logs for your production application depends on the deployment provider.
Other factors, like how long the logs last for also depend on the deployment provider.
Using logs, you can diagnose errors with your software in production as you would locally.
But what if there are lots, and lots of logs...?
These logs are structured using OpenTelemetry, a common standard that defines log structure make logs easily searchable at large scales.
Sometimes, we just want to know how we are performing in aggregate - without getting into specific details.
Metrics allow us to measure specific values that we care about and build real-time charts showing us these numbers.
Metrics help us answer questions like:
Some of these values are provided to you by your deployment provider - others, you need to instrument yourself.
This chart is showing a series of requests made, by HTTP status code.
The ability to visualise how our software is operating helps us understand it much better, particularly as traffic grows.
This is what the code to instrument the numbers behind this chart might look like:
type Metrics = {
increment: (metric: string, params: unknown) => void
}
function askLLM(question: string, metrics: Metrics) {
// blah blah blah
// Make an call to the LLM API
const answer = llmApiCall(question)
// Increment the metric 'llm-api-call', noting the status code
metrics.increment('llm-api-call', { status: answer.response.status })
return answer;
}
Rather than having to manually check the metrics every day, we can use tools to automatically notify us when there are spikes in metrics we don't consider normal
alarm.when(error5xx.count>=100).over('60m')
It's important to think properly about what constitutes "normal" for your software's behaviour.
E.g. is it a good idea to set an alarm every time there is a 403 error from your API?
Observability is the formal term for what we've been discussing.
Observability is the ability to understand a system's internal state by examining its external outputs (logs, metrics and traces).
Some popular observability tools include:
When there is a lot of traffic, which is constantly changing, it can be hard to know how to configure alarms/detectors to let us know when things are going wrong.
Software will always have errors, or timeouts, or other anomalous conditions. Instead of fixing every possible error, we want to have everything work as expected most of the time.
How you define "most" depends on the amount of risk you are willing to tolerate.
That's why, at large scales, we typically measure overall operations as a percentage (%).
We also measure based on user-centric experiences (user stories).
For example: What percent of users were successfully able to create a quiz question?
Traditionally in a EULA (End User License Agreement) there will be a series of SLAs - Service Level Agreements.
In these SLAs, the company/service makes promises to the customer such as "We guarantee that our LLM API will be available 99.99% of the time".
Each SLA maps to an SLO - Service Level Objective which is important for the service to measure and monitor.
Service Level Objectives are capabilities that the service provides to a customer that must function as expected for a % of the time.
type Metrics = {
increment: (metric: string, params: unknown) => void
}
function createKahoot(metrics: Metrics) {
try {
// Create the kahoot
// ...
// Kahoot was created successfully
metrics.increment('kahoot-create-success')
} catch {
// Kahoot failed to create
metrics.increment('kahoot-create-failed')
}
}
type Metrics = {
timer: (metric: string) => void
}
function loadKahootWithMetrics(metrics: Metrics) {
const timer = metrics.timer('load-kahoot')
// Wrap the function call in a timer
// so we know how long it took
timer.start()
loadKahoot()
timer.finish()
}
How you operate your software will depend entirely on how the software is expected to be used by your customers.
For example, a trading company does not need to operate its software 24/7 - it only needs to make sure it works during trading hours.
Technology that is expected to have 24/7 uptime (i.e. it is used globally, or by users at all times) will often have a corresponding operation schedule.
This brings with it a series of rituals:
If you are a company that cares about having its software work consistently, then you probably need to pay someone to be on call 24/7 in case things break.
When the stakes are lower, you can afford to leave issues to the next working day. Sometimes it is worth waking someone up at 3am to fix an issue though!
Atlassian provides a tool for on-call operations called Jira Service Management.
The on call engineer is notified on conditions such as:
Different issues have different priorities (P-levels):
The on call engineer's job is to:
They can do this by looking at logs and metrics and examining the source code.
The on call engineer often doesn't fully understand the issues they are trying to investigate.
To solve this we provide runbooks which give step-by-step instructions on what to do if there is an issue with part of the service.
POV: You are on call for your kahoot service, which is being used by real people - and you got alerted about a spike in 5XX errors.
You investigated the issue - looked at the logs, checked metrics, followed the runbook and still can't figure out what is going wrong. You are still getting more 5XX errors.
It's time to raise an incident. This process depends on your company - at Atlassian we call them HOTs.
An incident is an unplanned interruption or reduction to your quality of service. This incident can impact customers, revenue, or even safety in the worst case.
Incident management is a a seperate topic - but in a nutshell, it involves:
Some large-scale incidents span over days or even weeks, if customer agreements have been breached and the company needs to recover data or patch a major security issue.
TechOps is a ritual where the team looks at the overall performance of their software over the last week. It is often done as a hand over between engineers on on-call shifts.
In the techops meeting, the team looks at things such as:
The team also comes up with action items to address or investigate issues identified during the ops meeting.
Your software likely has a series of costs that you also need to monitor and control. Costs include:
In university/small scale projects these are not as much of a concern, since you are using free tier versions, or are paid for by the university. As soon as you start building real software, you will need to consider costs.
Costs should be monitored as part of the weekly techops ritual, and alerts should be in place for anything that might cause a spike in costs.
E.g. There is a spike in calls to your LLM API, which will cost you $$$ since you are charged per input token.
Part of operating software is keeping it secure.
Your code will often rely on other code (NPM packages), which may have security vulnerabilities. When these are discovered and fixed, it's important to update your package versions.
The CVE (Common Vulnerabilities and Exposures) program is a US-government initiative which identifies and documents security vulnerabilities https://www.cve.org/
A well known recent CVE was the log4j vuln, a vulnerability in a Java library that allowed attackers the ability to arbitrarily execute code. https://nvd.nist.gov/vuln/detail/cve-2021-44228