May 15, 2026 · 13 min read

Machine Learning Experiments Need Observability

KerasTuner handles search, but experiment review needs a lightweight observability layer around tuning.

Machine Learning

Observability

AI Engineering

Machine learning experiments need more than model code. They need observability.

Hyperparameter tuning is usually presented as a modelling activity: define the search space, run the tuner, select the best result.

In practice, that is only part of the work.

The more important question is often: can we understand how that result was produced?

For individual experimentation, that means being able to inspect learning curves, compare trial behaviour, and understand whether a model is genuinely improving or simply producing a lucky score. For teams, it means being able to review experiments, reproduce decisions, and explain why one configuration was chosen over another.

That visibility is easy to underestimate until it is missing.

While working with KerasTuner, I found that the tuning engine itself was useful, but the surrounding workflow was fragmented. Trial outputs were spread across folders, comparison data required manual aggregation, and live inspection usually meant watching terminal logs or repeatedly reopening CSV files.

So I built a small workflow stack around KerasTuner to make experiments easier to observe, compare, and explain.

kt-masterlog

Logging

Captures trials, epochs, hyperparameters, metrics, and metadata into one master CSV

kt-masterviz

Visualisation

Provides a live dashboard for inspecting tuning progress and learning curves

kt-masterdemo

Demonstration

Shows the end-to-end workflow using a neutral toy dataset

The workflow is simple:

tune -> log -> inspect -> compare -> explain -> iterate

The problem: tuning results are often harder to inspect than they should be

KerasTuner is useful because it handles the mechanics of hyperparameter search. It lets you define a search space, run trials, and identify promising configurations.

However, once an experiment starts producing results, the surrounding workflow can become awkward.

The data you need is often spread across different places. Some of it lives in trial directories. Some appears in logs. Some can be recovered through custom scripts. If you want a clean comparison across trials, you often end up writing another pandas aggregation script. If you want to understand how each trial behaved across epochs, you need to extract, reshape, and plot the data yourself.

That is manageable once. It becomes wasteful when repeated.

More importantly, it weakens the quality of the experiment review process.

A tuning run should not only answer:

Which trial had the best score?

It should also help answer:

How did that trial behave?
Was the validation performance stable?
Did the model overfit?
Were some configurations consistently weak?
Did the search space itself make sense?
Can I explain why this configuration was selected?

Those questions matter because model selection is not just a scoreboard exercise. It is a decision-making process.

If the evidence behind the decision is hard to inspect, the decision becomes harder to trust.

Why this matters beyond individual development

For a solo developer or student, poor experiment visibility is annoying.

For a technical team, it becomes a delivery problem.

For an AI or ML function, it becomes a governance problem.

When experiments are difficult to inspect, teams lose time reconstructing what happened. Results become harder to compare. Decisions become harder to explain. Knowledge stays inside notebooks, terminal output, local folders, or the memory of whoever ran the experiment.

That does not scale well.

As AI work moves from isolated experimentation into team-based delivery, the surrounding workflow starts to matter much more. The model code is only one part of the system. The process around it also needs to support:

repeatability
comparison
reviewability
evidence capture
learning from failed trials
communication between technical and non-technical stakeholders

This is especially important when model decisions need to be explained to people who were not involved in the original experiment.

A Head of Engineering, AI Lead, technical reviewer, or governance stakeholder does not necessarily need to inspect every line of model code. But they do need confidence that the experiment was run in a way that can be understood, reviewed, and defended.

That is the broader principle behind this stack.

It is not just about logging metrics. It is about making the experiment easier to reason about.

The stack: logging, visualisation, demonstration

The stack has three parts.

kt-masterlog is the writer. It captures the tuning process into one structured master CSV.

kt-masterviz is the reader. It turns that CSV into a live dashboard for inspecting trials and learning curves.

kt-masterdemo is the demonstration layer. It shows the workflow end-to-end using a small neutral toy dataset.

Each part has a deliberately narrow responsibility.

kt-masterlog

Write structured experiment logs

kt-masterviz

Read and visualise compatible logs

kt-masterdemo

Demonstrate the workflow end-to-end

The packages are connected through files, not tight internal dependencies.

That design choice keeps the system simple. The logger does not need the dashboard. The dashboard does not need to import the logger at runtime. The demo does not exist to prove model performance. It exists to show how the pieces work together.

The result is a small but coherent workflow:

run a sweep
capture the evidence
inspect the curves
compare the trials
explain the decision
iterate with better information

kt-masterlog: creating a single experiment record

kt-masterlog started from a simple requirement: I wanted one flat, readable, analysis-friendly record of a tuning run.

The core idea is to write one row per epoch, per trial.

That row can include:

trial ID
epoch number
hyperparameters
training metrics
validation metrics
objective score
optional metadata

Instead of having to reconstruct the experiment from scattered outputs, the entire run becomes available as a single CSV.

That sounds basic, but it changes the workflow.

A CSV can be opened directly, loaded into pandas, inspected in a notebook, stored as an artefact, or passed into another tool. It becomes a simple evidence layer for the experiment.

The package includes several core components:

MasterEpochLogger

Keras callback that writes one flat row per epoch per trial

make_logging_tuner()

Dynamically wraps KerasTuner strategies to inject the logger

TunerConfig

Serializable tuning configuration

optimize()

Higher-level orchestration helper

TuningResult

Structured result object containing model, hyperparameters, timing, paths, and summary data

One important design decision was to avoid creating separate subclasses for each KerasTuner strategy.

Instead, make_logging_tuner() dynamically subclasses the selected tuner strategy and injects logging behaviour. That means switching between Bayesian optimisation, Hyperband, Random Search, or another compatible strategy becomes more of a configuration choice than a code restructuring exercise.

This matters because experiment infrastructure should not make iteration harder.

If someone wants to change the search strategy, they should not have to rewrite the plumbing around the experiment. The logging layer should remain stable while the tuning strategy changes.

That is the role of kt-masterlog: keep the experiment record consistent, even as the experiment itself evolves.

kt-masterviz: making experiments observable while they run

Once the master CSV existed, the next problem became obvious.

A CSV is useful, but it is not always the best live interface.

During a tuning run, I wanted to see what was happening without repeatedly reopening files or watching terminal output. I wanted to see trial summaries, inspect curves, switch between metrics, and understand whether the experiment was behaving sensibly while it was still running.

That became kt-masterviz.

kt-masterviz is a small Streamlit dashboard that reads a master CSV produced by kt-masterlog, or any CSV with the same general shape.

It provides:

trial summaries sorted by objective metric
per-trial training and validation curves
switchable metrics such as loss, val_loss, accuracy, and val_accuracy
file-safe live reading while a tuner is still writing
auto-refresh, so completed trials appear as the sweep progresses

The dashboard is intentionally lightweight. It is not trying to become a full ML platform. It is there to make the tuning run easier to inspect.

That distinction is important.

Many workflow problems do not need a large platform answer. Sometimes the right answer is a small layer that removes a repeated point of friction.

In this case, the friction was visibility.

Learning curves are not decoration. They are evidence.

They show whether a model is learning, overfitting, plateauing, oscillating, or failing to improve. They help distinguish a genuinely promising configuration from one that simply produced a good final score. They also make it easier to discuss the experiment with someone else.

That makes the dashboard useful not only for the person running the experiment, but also for review and communication.

The manifest: connecting the stack without coupling it

To make the workflow smoother, kt-masterlog writes a small JSON manifest when a sweep starts.

The manifest is stored under:

~/.kt-masterlog/runs/

This lets kt-masterviz open the most recently started run with:

kt-masterviz --latest

That means the workflow can look like this:

# Terminal 1
python examples/synthetic_mlp/run_sweep.py

# Terminal 2
kt-masterviz --latest

This is a small feature, but it improves the experience significantly.

Without the manifest, the user has to find the correct CSV path, copy it, and pass it to the dashboard. With the manifest, the logger and visualiser remain decoupled while still behaving like one workflow.

That was the design balance I wanted:

shared file contract
low coupling
simple runtime behaviour
minimal user friction

The dashboard does not need to import the logger. The logger does not need to know how the dashboard works. They coordinate through the on-disk format.

That keeps the stack easier to reason about and easier to extend.

kt-masterdemo: proving the workflow end-to-end

The third piece is kt-masterdemo.

The purpose of the demo is not to show a high-performing model. It is not a benchmark repository. It is not meant to provide a worked solution to a specific machine learning exercise.

It exists to show the workflow.

The demo uses a small neutral toy dataset and a simple MLP classification example. That was deliberate. The point is to demonstrate the logging and visualisation flow without tying the repository to a specific coursework-style image classification task.

The demo shows how to:

run a small hyperparameter sweep
write trial and epoch metrics into a master CSV
create a run manifest
open the latest run in kt-masterviz
inspect trial summaries and learning curves live

A typical flow looks like this:

uv sync
uv run python examples/synthetic_mlp/run_sweep.py
uv run kt-masterviz --latest

The dataset is intentionally unimportant.

The pattern is the important part.

Design principles

Although this is a small stack, I tried to keep the design principles clear.

1. Separate responsibilities

Each package should do one thing well.

kt-masterlog writes structured experiment records.

kt-masterviz reads and visualises those records.

kt-masterdemo demonstrates the workflow.

That separation keeps the packages cleaner and makes them easier to use independently.

2. Prefer a simple file contract

The master CSV is the contract.

That means the visualisation layer does not need deep knowledge of the logging internals. It only needs a compatible shape.

This creates flexibility. Another logger could emit the same shape. Another dashboard could read the same CSV. A reporting tool could generate summaries from the same file.

The stable intermediate format is where much of the value sits.

3. Avoid unnecessary runtime coupling

kt-masterviz does not import kt-masterlog at runtime.

That means users can install only the logger if they do not want Streamlit, or only the dashboard if they already have compatible CSVs.

This avoids turning a small utility into a heavier dependency chain.

4. Optimise for inspectability, not cleverness

The goal is not to hide complexity behind magic.

The goal is to make the experiment easier to inspect.

That is why the master CSV is deliberately simple. It is easy to open, easy to read, easy to archive, and easy to analyse with standard tools.

5. Keep the demo neutral

The demo is designed to show the workflow, not solve a specific modelling problem.

That matters because examples influence how tools are perceived. A clean demo should help people understand the system without implying that the model itself is the main contribution.

Why this pattern matters for AI engineering

The broader pattern is more important than the packages themselves.

AI engineering is not just about building models. It is about creating systems and workflows that help people make better decisions with confidence.

That includes the ability to observe what happened, compare alternatives, preserve evidence, and explain the reasoning behind a selected approach.

In a small experiment, this might feel like convenience.

In a team environment, it becomes engineering hygiene.

In a regulated or enterprise environment, it becomes part of the control fabric around model development.

The same principle appears in larger AI platform conversations:

model observability
experiment tracking
lineage
governance
reproducibility
auditability
decision traceability

This stack is intentionally much smaller than those enterprise platforms, but it reflects the same underlying idea: make the process visible enough that it can be trusted.

That is why I find the pattern interesting.

A relatively small amount of structure around an ML workflow can improve the quality of experimentation. It can reduce repeated manual work. It can make results easier to explain. It can also help teams move from “I ran a model and got a score” to “we can understand why this result is credible.”

That is a different level of maturity.

Installation

Install the logger:

pip install kt-masterlog

or:

uv add kt-masterlog

Install the dashboard:

pip install kt-masterviz

or:

uv add kt-masterviz

Open the dashboard against the most recently started run:

kt-masterviz --latest

Repositories

kt-masterlog: github.com/techspeque/kt-masterlog
kt-masterviz: github.com/techspeque/kt-masterviz
kt-masterdemo: github.com/techspeque/kt-masterdemo

PyPI

kt-masterlog: pypi.org/project/kt-masterlog
kt-masterviz: pypi.org/project/kt-masterviz

What I learned from building it

This started as a small frustration during model experimentation, but it reinforced a broader lesson: the workflow around machine learning often matters as much as the model code itself.

A model experiment is not just a script that produces a score. It is a process of observation, comparison, diagnosis, explanation, and iteration.

If that process is fragmented, the learning is weaker.

If that process is captured clearly, the experiment becomes easier to understand and easier to build on.

That is what this stack is trying to support.

Not a new modelling framework. Not a replacement for KerasTuner. Just a lightweight experiment observability layer that turns tuning runs into something more inspectable, comparable, and explainable.

In practice, that is often where the real productivity gain sits.