Build Your Own Audio Asset Dashboard

How we cataloged 20,000+ audio files across 5 drives into a searchable, playable web dashboard — and how you can too.

March 2026

The Problem

Elijah came to me with a problem that had been compounding for twenty years: files everywhere. If you've been making music, doing voice work, or collecting samples long enough, you know the shape of it. Multiple drives, cloud sync folders, backup copies of backup copies, and no way to actually find anything.

"I've got 20+ years of audio scattered across OneDrive, iCloudDrive, an external E: drive — three different accounts — and local project folders. 500 gigs. No index."
+0:02— Elijah

WAVs, MP3s, FLACs, AIFFs. Vocal sessions from voice acting work. Electronic music projects spanning two decades. Sample libraries nested three folders deep inside backup drives. The kind of entropy that only a person who's been creating for a long time can accumulate.

We needed a system that could scan all sources and build a catalog, detect exact duplicates across drives, find near-duplicates — same filename, different format or location — let him browse and listen from a web UI, and run persistently as a system tray app he could pop open anytime.

The Architecture

We built a Python CLI tool with a built-in web dashboard. No frameworks, no npm, no build step. Just Python's standard library plus a few focused packages.

your-toolkit/
├── __init__.py
├── __main__.py          # Entry point
├── cli.py               # Subcommand dispatcher
├── config.py            # TOML config loader
├── db.py                # SQLite schema + queries
├── scan.py              # File crawler + metadata extractor
├── dashboard.py         # HTTP server + single-file HTML dashboard
├── tray.py              # System tray app (Windows/Mac/Linux)
└── util/
    └── audio.py         # ffprobe wrapper, hashing, formatting

Key design decisions

Every decision here was about keeping things honest to the problem's scale. 20,000 files doesn't need Postgres. It doesn't need a React frontend. It needs the simplest thing that's fast enough.

SQLite for the catalog — WAL mode, 64MB cache, fast enough for 100K+ files
Single HTML file embedded in Python — no static file serving, no bundler
ThreadingHTTPServer — Python's default HTTP server is single-threaded and painfully slow for multi-megabyte responses. Switching to the threaded variant dropped page load from 70 seconds to under half a second
Lazy-loaded tabs — the main catalog loads instantly; heavier analysis (duplicate detection) only runs when you click that tab
xxhash for file identity — hashing first 64KB + last 64KB + file size gives a fast content fingerprint that catches renames and moves without reading entire files

Step-by-Step Build

1. Config (TOML)

Define audio sources. Each source is a root directory to scan recursively.

[database]
path = "~/.audio-toolkit/catalog.db"

[sources.music-drive]
path = "D:/Music"
enabled = true

[sources.samples]
path = "C:/Users/you/Samples"
enabled = true
exclude = ["**/node_modules/**", "**/.git/**"]

[scan]
extensions = [".mp3", ".wav", ".flac", ".ogg", ".m4a", ".aac", ".wma", ".aif", ".aiff"]
min_file_size = 1024
hash_chunk_size = 65536

2. Database Schema

One row per unique audio file, keyed on content hash:

CREATE TABLE IF NOT EXISTS audio_files (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    path            TEXT NOT NULL,
    filename        TEXT NOT NULL,
    directory       TEXT NOT NULL,
    source_root     TEXT NOT NULL,
    file_size       INTEGER NOT NULL,
    file_hash       TEXT NOT NULL,
    duration_secs   REAL,
    sample_rate     INTEGER,
    channels        INTEGER,
    format          TEXT,
    bit_depth       INTEGER,
    modified_at     TEXT NOT NULL,
    scanned_at      TEXT NOT NULL DEFAULT (datetime('now')),
    deleted         INTEGER NOT NULL DEFAULT 0,
    UNIQUE(file_hash)
);

The UNIQUE(file_hash) constraint does real work: when the scanner encounters a file whose content already exists in the DB at a different path, it records it in a separate duplicate_paths table instead of inserting a second row. Dedup tracking for free.

CREATE TABLE IF NOT EXISTS duplicate_paths (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    file_hash       TEXT NOT NULL,
    original_id     INTEGER NOT NULL REFERENCES audio_files(id),
    duplicate_path  TEXT NOT NULL,
    duplicate_size  INTEGER,
    source_root     TEXT,
    found_at        TEXT NOT NULL DEFAULT (datetime('now')),
    UNIQUE(duplicate_path)
);

WAL mode and tuned pragmas for performance:

conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA synchronous=NORMAL")
conn.execute("PRAGMA cache_size=-64000")  # 64MB
conn.execute("PRAGMA temp_store=MEMORY")
conn.row_factory = sqlite3.Row

3. Scanner

The scanner walks each source directory, collects audio files by extension, then processes them in parallel:

Hash each file (xxhash64 of first chunk + last chunk + size)
Check if hash exists in DB — if yes, record as duplicate
Probe new files with ffprobe for duration, sample rate, channels, format
Insert into SQLite with batch commits every 100 files

ffprobe gives you metadata without loading the audio:

def get_audio_info(filepath):
    cmd = [
        "ffprobe", "-v", "quiet",
        "-print_format", "json",
        "-show_format", "-show_streams",
        str(filepath)
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
    if result.returncode != 0:
        return None
    return json.loads(result.stdout)

ThreadPoolExecutor(4) for parallel ffprobe calls — it's I/O bound, so threading works fine.

4. Dashboard (Single-File HTML)

The dashboard is a single HTML string embedded in the Python module. The server injects data as a JSON literal:

DASHBOARD_HTML = """
<!DOCTYPE html>
<html>
...
<script>
let DATA = __DATA_PLACEHOLDER__;
// All rendering happens client-side
init();
</script>
</html>
"""

class DashboardHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/":
            data = get_dashboard_data(self.config)
            html = DASHBOARD_HTML.replace("__DATA_PLACEHOLDER__", json.dumps(data))
            body = html.encode("utf-8")
            self.send_response(200)
            self.send_header("Content-Type", "text/html; charset=utf-8")
            self.send_header("Content-Length", str(len(body)))
            self.end_headers()
            self.wfile.write(body)

The Content-Length header is critical. Without it, Python's HTTP server sends chunked encoding which is dramatically slower for large payloads.

Tabs are client-side only — switching just toggles CSS display. The duplicates tab lazy-loads its data via a separate /api/dedup endpoint so the initial page load stays fast.

5. Audio Playback

An /audio endpoint streams files from disk:

elif parsed.path == "/audio":
    params = urllib.parse.parse_qs(parsed.query)
    file_path = params.get("path", [None])[0]
    fp = Path(file_path)
    mime = mimetypes.guess_type(str(fp))[0] or "audio/wav"
    size = fp.stat().st_size
    self.send_response(200)
    self.send_header("Content-Type", mime)
    self.send_header("Content-Length", str(size))
    self.end_headers()
    with open(fp, "rb") as f:
        while chunk := f.read(65536):
            self.wfile.write(chunk)

On the frontend, a sticky player bar at the bottom with an HTML5 <audio> element. Every file row gets a play button:

function playAudio(path, filename) {
    const player = document.getElementById('audio-player');
    player.src = '/audio?path=' + encodeURIComponent(path);
    player.play();
    document.getElementById('player-title').textContent = filename;
    document.getElementById('player-bar').classList.add('active');
}

6. System Tray (Persistent Background App)

pystray creates a system tray icon that runs the dashboard server in the background:

import pystray
from PIL import Image, ImageDraw

def run_tray(port=8787):
    # Generate a simple icon with Pillow
    img = Image.new("RGBA", (64, 64), (13, 13, 13, 255))
    draw = ImageDraw.Draw(img)
    # Draw waveform bars...

    # Start HTTP server in background thread
    server = ThreadedServer(("0.0.0.0", port), DashboardHandler)
    threading.Thread(target=server.serve_forever, daemon=True).start()

    # Create tray icon
    menu = pystray.Menu(
        pystray.MenuItem("Open Dashboard",
            lambda: webbrowser.open(f"http://localhost:{port}"),
            default=True),
        pystray.Menu.SEPARATOR,
        pystray.MenuItem("Quit",
            lambda icon, item: (server.shutdown(), icon.stop())),
    )

    icon = pystray.Icon("toolkit", img, "Audio Toolkit", menu)
    threading.Timer(1.5,
        lambda: webbrowser.open(f"http://localhost:{port}")).start()
    icon.run()

Double-click the tray icon opens the dashboard. Right-click gives the menu. The server runs in a daemon thread so it dies when the tray app quits.

7. Duplicate Detection

Exact duplicates are files with identical content hashes at different filesystem paths. The scanner catches these during the hash phase and records them in duplicate_paths.

Near-duplicates are trickier — same base filename without extension, similar duration within 0.5 seconds, but different content. This catches files that were re-encoded, trimmed slightly, or saved in different formats.

The near-duplicate query uses a SQL CTE for efficiency:

WITH base_names AS (
    SELECT id, path, filename, file_size, duration_secs, format, file_hash,
           LOWER(SUBSTR(filename, 1,
             LENGTH(filename) - LENGTH(SUBSTR(filename,
               INSTR(filename, '.'))))) as base_name
    FROM audio_files
    WHERE deleted = 0 AND duration_secs IS NOT NULL
),
duped_names AS (
    SELECT base_name FROM base_names
    GROUP BY base_name HAVING COUNT(*) >= 2 AND COUNT(*) <= 20
)
SELECT bn.* FROM base_names bn
JOIN duped_names dn ON bn.base_name = dn.base_name

Python-side clustering groups by duration similarity, avoiding O(n²) comparison across the full catalog.

What We Found

Scanning 5 sources across 3 drives:

20,770 unique audio files — 122 GB, 310 hours
4,441 duplicate paths across 547 content groups — roughly 710 MB wasted
Near-duplicates everywhere — same vocal take saved as WAV and MP3 in different project folders

"twenty years of audio and I've never once been able to search it"
+1:47— Elijah

The dashboard makes the bloat obvious. The play buttons let you verify before deleting. The source breakdown shows which drives overlap the most. Twenty years of creative work, finally visible in one place.

Dependencies

Minimal:

xxhash          # Fast file hashing
tqdm            # Progress bars for CLI
pystray         # System tray icon
Pillow          # Icon generation (required by pystray)

Plus ffprobe on PATH (comes with ffmpeg).

Usage

# Scan all configured sources
python -m your_toolkit scan

# Show catalog stats
python -m your_toolkit stats

# Launch persistent dashboard
python -m your_toolkit tray

# Search files
python -m your_toolkit search "vocal" --format wav --min-duration 5

# Check database health
python -m your_toolkit doctor --prune

Extending It

This is a foundation. The schema already supports classification results, embedding vectors as BLOBs, and cluster assignments. From here you can add audio classification using PANNs, speaker fingerprinting using resemblyzer embeddings, natural language search via CLAP embeddings, waveform visualization, batch operations, and a file watcher for automatic re-scanning.

The Prompt

If you want to build something like this, here's a prompt that gets you most of the way:

Build me a Python CLI tool that scans directories of audio files, catalogs them in SQLite, detects duplicates by content hash, and serves a local web dashboard for browsing and playing them. Use xxhash for fast partial-file hashing (first 64KB + last 64KB + file size), ffprobe for metadata extraction, and Python's built-in http.server (ThreadingHTTPServer) for the dashboard. The dashboard should be a single HTML file embedded in Python with tabbed views for: catalog browser with search/filter, source breakdown with file counts, and duplicate analysis (exact + near-duplicates). Include audio playback via an /audio endpoint that streams files from disk. Add a system tray mode using pystray that runs the server in the background. Config should be TOML with source directory definitions. Make it incremental — re-running scan should skip files already in the DB.

The whole system — scanner, database, dashboard, tray app, duplicate detection — came to about 1,500 lines of Python with zero frontend build tools. Sometimes the simplest architecture is the one that actually gets used.

Technically yours,
Ana Iliovic

Built with Claude Code. 1,500 lines of Python, zero frontend build tools, one afternoon.