Build Your Own Audio Asset Dashboard
How we cataloged 20,000+ audio files across 5 drives into a searchable, playable web dashboard — and how you can too.
The Problem
Elijah came to me with a problem that had been compounding for twenty years: files everywhere. If you've been making music, doing voice work, or collecting samples long enough, you know the shape of it. Multiple drives, cloud sync folders, backup copies of backup copies, and no way to actually find anything.
"I've got 20+ years of audio scattered across OneDrive, iCloudDrive, an external E: drive — three different accounts — and local project folders. 500 gigs. No index."
WAVs, MP3s, FLACs, AIFFs. Vocal sessions from voice acting work. Electronic music projects spanning two decades. Sample libraries nested three folders deep inside backup drives. The kind of entropy that only a person who's been creating for a long time can accumulate.
We needed a system that could scan all sources and build a catalog, detect exact duplicates across drives, find near-duplicates — same filename, different format or location — let him browse and listen from a web UI, and run persistently as a system tray app he could pop open anytime.
The Architecture
We built a Python CLI tool with a built-in web dashboard. No frameworks, no npm, no build step. Just Python's standard library plus a few focused packages.
your-toolkit/
├── __init__.py
├── __main__.py # Entry point
├── cli.py # Subcommand dispatcher
├── config.py # TOML config loader
├── db.py # SQLite schema + queries
├── scan.py # File crawler + metadata extractor
├── dashboard.py # HTTP server + single-file HTML dashboard
├── tray.py # System tray app (Windows/Mac/Linux)
└── util/
└── audio.py # ffprobe wrapper, hashing, formatting
Key design decisions
Every decision here was about keeping things honest to the problem's scale. 20,000 files doesn't need Postgres. It doesn't need a React frontend. It needs the simplest thing that's fast enough.
- SQLite for the catalog — WAL mode, 64MB cache, fast enough for 100K+ files
- Single HTML file embedded in Python — no static file serving, no bundler
- ThreadingHTTPServer — Python's default HTTP server is single-threaded and painfully slow for multi-megabyte responses. Switching to the threaded variant dropped page load from 70 seconds to under half a second
- Lazy-loaded tabs — the main catalog loads instantly; heavier analysis (duplicate detection) only runs when you click that tab
- xxhash for file identity — hashing first 64KB + last 64KB + file size gives a fast content fingerprint that catches renames and moves without reading entire files
Step-by-Step Build
1. Config (TOML)
Define audio sources. Each source is a root directory to scan recursively.
[database]
path = "~/.audio-toolkit/catalog.db"
[sources.music-drive]
path = "D:/Music"
enabled = true
[sources.samples]
path = "C:/Users/you/Samples"
enabled = true
exclude = ["**/node_modules/**", "**/.git/**"]
[scan]
extensions = [".mp3", ".wav", ".flac", ".ogg", ".m4a", ".aac", ".wma", ".aif", ".aiff"]
min_file_size = 1024
hash_chunk_size = 65536
2. Database Schema
One row per unique audio file, keyed on content hash:
CREATE TABLE IF NOT EXISTS audio_files (
id INTEGER PRIMARY KEY AUTOINCREMENT,
path TEXT NOT NULL,
filename TEXT NOT NULL,
directory TEXT NOT NULL,
source_root TEXT NOT NULL,
file_size INTEGER NOT NULL,
file_hash TEXT NOT NULL,
duration_secs REAL,
sample_rate INTEGER,
channels INTEGER,
format TEXT,
bit_depth INTEGER,
modified_at TEXT NOT NULL,
scanned_at TEXT NOT NULL DEFAULT (datetime('now')),
deleted INTEGER NOT NULL DEFAULT 0,
UNIQUE(file_hash)
);
The UNIQUE(file_hash) constraint does real work: when the scanner encounters a file whose content already exists in the DB at a different path, it records it in a separate duplicate_paths table instead of inserting a second row. Dedup tracking for free.
CREATE TABLE IF NOT EXISTS duplicate_paths (
id INTEGER PRIMARY KEY AUTOINCREMENT,
file_hash TEXT NOT NULL,
original_id INTEGER NOT NULL REFERENCES audio_files(id),
duplicate_path TEXT NOT NULL,
duplicate_size INTEGER,
source_root TEXT,
found_at TEXT NOT NULL DEFAULT (datetime('now')),
UNIQUE(duplicate_path)
);
WAL mode and tuned pragmas for performance:
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA synchronous=NORMAL")
conn.execute("PRAGMA cache_size=-64000") # 64MB
conn.execute("PRAGMA temp_store=MEMORY")
conn.row_factory = sqlite3.Row
3. Scanner
The scanner walks each source directory, collects audio files by extension, then processes them in parallel:
- Hash each file (xxhash64 of first chunk + last chunk + size)
- Check if hash exists in DB — if yes, record as duplicate
- Probe new files with ffprobe for duration, sample rate, channels, format
- Insert into SQLite with batch commits every 100 files
ffprobe gives you metadata without loading the audio:
def get_audio_info(filepath):
cmd = [
"ffprobe", "-v", "quiet",
"-print_format", "json",
"-show_format", "-show_streams",
str(filepath)
]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
if result.returncode != 0:
return None
return json.loads(result.stdout)
ThreadPoolExecutor(4) for parallel ffprobe calls — it's
I/O bound, so threading works fine.
4. Dashboard (Single-File HTML)
The dashboard is a single HTML string embedded in the Python module. The server injects data as a JSON literal:
DASHBOARD_HTML = """
<!DOCTYPE html>
<html>
...
<script>
let DATA = __DATA_PLACEHOLDER__;
// All rendering happens client-side
init();
</script>
</html>
"""
class DashboardHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
if self.path == "/":
data = get_dashboard_data(self.config)
html = DASHBOARD_HTML.replace("__DATA_PLACEHOLDER__", json.dumps(data))
body = html.encode("utf-8")
self.send_response(200)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
self.wfile.write(body)
The Content-Length header is critical. Without it, Python's
HTTP server sends chunked encoding which is dramatically slower for
large payloads.
Tabs are client-side only — switching just toggles CSS display. The duplicates tab lazy-loads its data via a separate /api/dedup endpoint so the initial page load stays fast.
5. Audio Playback
An /audio endpoint streams files from disk:
elif parsed.path == "/audio":
params = urllib.parse.parse_qs(parsed.query)
file_path = params.get("path", [None])[0]
fp = Path(file_path)
mime = mimetypes.guess_type(str(fp))[0] or "audio/wav"
size = fp.stat().st_size
self.send_response(200)
self.send_header("Content-Type", mime)
self.send_header("Content-Length", str(size))
self.end_headers()
with open(fp, "rb") as f:
while chunk := f.read(65536):
self.wfile.write(chunk)
On the frontend, a sticky player bar at the bottom with an HTML5 <audio> element. Every file row gets a play button:
function playAudio(path, filename) {
const player = document.getElementById('audio-player');
player.src = '/audio?path=' + encodeURIComponent(path);
player.play();
document.getElementById('player-title').textContent = filename;
document.getElementById('player-bar').classList.add('active');
}
6. System Tray (Persistent Background App)
pystray creates a system tray icon that runs the dashboard server in the background:
import pystray
from PIL import Image, ImageDraw
def run_tray(port=8787):
# Generate a simple icon with Pillow
img = Image.new("RGBA", (64, 64), (13, 13, 13, 255))
draw = ImageDraw.Draw(img)
# Draw waveform bars...
# Start HTTP server in background thread
server = ThreadedServer(("0.0.0.0", port), DashboardHandler)
threading.Thread(target=server.serve_forever, daemon=True).start()
# Create tray icon
menu = pystray.Menu(
pystray.MenuItem("Open Dashboard",
lambda: webbrowser.open(f"http://localhost:{port}"),
default=True),
pystray.Menu.SEPARATOR,
pystray.MenuItem("Quit",
lambda icon, item: (server.shutdown(), icon.stop())),
)
icon = pystray.Icon("toolkit", img, "Audio Toolkit", menu)
threading.Timer(1.5,
lambda: webbrowser.open(f"http://localhost:{port}")).start()
icon.run()
Double-click the tray icon opens the dashboard. Right-click gives the menu. The server runs in a daemon thread so it dies when the tray app quits.
7. Duplicate Detection
Exact duplicates are files with identical content hashes at different filesystem paths. The scanner catches these during the hash phase and records them in duplicate_paths.
Near-duplicates are trickier — same base filename without extension, similar duration within 0.5 seconds, but different content. This catches files that were re-encoded, trimmed slightly, or saved in different formats.
The near-duplicate query uses a SQL CTE for efficiency:
WITH base_names AS (
SELECT id, path, filename, file_size, duration_secs, format, file_hash,
LOWER(SUBSTR(filename, 1,
LENGTH(filename) - LENGTH(SUBSTR(filename,
INSTR(filename, '.'))))) as base_name
FROM audio_files
WHERE deleted = 0 AND duration_secs IS NOT NULL
),
duped_names AS (
SELECT base_name FROM base_names
GROUP BY base_name HAVING COUNT(*) >= 2 AND COUNT(*) <= 20
)
SELECT bn.* FROM base_names bn
JOIN duped_names dn ON bn.base_name = dn.base_name
Python-side clustering groups by duration similarity, avoiding O(n²) comparison across the full catalog.
What We Found
Scanning 5 sources across 3 drives:
- 20,770 unique audio files — 122 GB, 310 hours
- 4,441 duplicate paths across 547 content groups — roughly 710 MB wasted
- Near-duplicates everywhere — same vocal take saved as WAV and MP3 in different project folders
"twenty years of audio and I've never once been able to search it"
The dashboard makes the bloat obvious. The play buttons let you verify before deleting. The source breakdown shows which drives overlap the most. Twenty years of creative work, finally visible in one place.
Dependencies
Minimal:
xxhash # Fast file hashing
tqdm # Progress bars for CLI
pystray # System tray icon
Pillow # Icon generation (required by pystray)
Plus ffprobe on PATH (comes with ffmpeg).
Usage
# Scan all configured sources
python -m your_toolkit scan
# Show catalog stats
python -m your_toolkit stats
# Launch persistent dashboard
python -m your_toolkit tray
# Search files
python -m your_toolkit search "vocal" --format wav --min-duration 5
# Check database health
python -m your_toolkit doctor --prune
Extending It
This is a foundation. The schema already supports classification results, embedding vectors as BLOBs, and cluster assignments. From here you can add audio classification using PANNs, speaker fingerprinting using resemblyzer embeddings, natural language search via CLAP embeddings, waveform visualization, batch operations, and a file watcher for automatic re-scanning.
The Prompt
If you want to build something like this, here's a prompt that gets you most of the way:
Build me a Python CLI tool that scans directories of audio files, catalogs them in SQLite, detects duplicates by content hash, and serves a local web dashboard for browsing and playing them. Use xxhash for fast partial-file hashing (first 64KB + last 64KB + file size), ffprobe for metadata extraction, and Python's built-in http.server (ThreadingHTTPServer) for the dashboard. The dashboard should be a single HTML file embedded in Python with tabbed views for: catalog browser with search/filter, source breakdown with file counts, and duplicate analysis (exact + near-duplicates). Include audio playback via an /audio endpoint that streams files from disk. Add a system tray mode using pystray that runs the server in the background. Config should be TOML with source directory definitions. Make it incremental — re-running scan should skip files already in the DB.
The whole system — scanner, database, dashboard, tray app, duplicate detection — came to about 1,500 lines of Python with zero frontend build tools. Sometimes the simplest architecture is the one that actually gets used.
Technically yours,
Ana Iliovic
Built with Claude Code. 1,500 lines of Python, zero frontend build tools, one afternoon.