Intentionally desktop-first — best experienced on a workstation
Portfolio
Lab Log 010 · Part 1 of 2

Building Ladon —
A Static Document Security Analyzer

Analyst
Yana Ivanov
Published
April 2026
Category
Lab Log · Tool Build
Stack
Python · Flask · Railway
Tool
Ladon · document_triage.py
Read Time
15 minutes
STATIC ANALYSIS ONLY · NO FILES EXECUTED · EDUCATIONAL USE · SITEWAVE STUDIO LLC
Section 01

Why This Tool Exists

The problem I kept running into while reading threat reports was a practical one. Someone receives a suspicious PDF. Their antivirus doesn't flag it. Their email gateway delivered it. They have no way to know if it's safe to open — not really. The advice is always "be careful with attachments" but there's no tool a non-technical person can use to actually check what's inside a file before they open it.

I decided to build one. I had no prior experience writing static file analysis tools. What I had was Python, a pile of file format specifications, and several months of reading threat reports that kept showing the same pattern: the threat was never where the existing defenses were looking. It was inside the attachment — inside the PDF, inside the image, inside the calendar invite — in places that email gateways don't reach and that users have no visibility into at all.

This lab documents how I built document_triage.py — the analysis engine at the core of Ladon. It's not a retrospective of a clean build. It's a record of the decisions made, the things that didn't work the first time, and what I learned from reading specifications that were never written with attackers in mind. The companion report The Document You Trusted covers the context and real-world validation. This lab covers the build.

A note on how this was built: I'm not a developer. My background is UX design and enterprise technology, not software engineering. I built Ladon the way I approach most problems — through research, pattern recognition, and iteration. The detection methodology came from reading threat intelligence and file format specifications. The Python came from working with Claude as a coding collaborator to translate that methodology into working code. The UX decisions — plain-English explanations, drag-and-drop interface, results readable by a non-technical employee — came from 15 years of designing for real users. That combination is intentional. Most security tools are built by engineers for engineers. Ladon was designed for the person who just received a suspicious PDF and has no idea what to do with it.

Build Environment
Development Host
macOS · Apple Silicon
Language
Python 3.12
Backend Framework
Flask 3.x
Deployment
Railway · Auto-deploy on push
Key Libraries
Pillow · pyzbar · zlib · struct
Repo
Private · SiteWave Studio LLC
Section 02

Architecture — Three Layers,
One Design Principle

The first decision I made was the most important one: the tool must never execute anything it analyzes. Everything else followed from that. Reading bytes is safe. Interpreting them as a runnable program is not. If the tool executed a malicious file to figure out what it did, it would defeat the entire purpose of triage. So Ladon reads bytes — nothing more.

The build ended up with three distinct layers, each with a single responsibility. Separating them cleanly took a few iterations to get right.

1
Pure Python. No web framework. Takes a file path, reads the raw bytes, runs all applicable detection modules, returns a structured results dictionary. Can be run from the command line independently of any web interface. This separation matters — the engine is the tool. The web layer is just delivery.
2
A minimal Flask application with two routes: GET / serves the frontend HTML and POST /analyze accepts a file upload, calls the engine, and returns JSON results. The server never writes the uploaded file to permanent storage — it processes from memory and discards.
3
Single-file HTML/CSS/JS. No framework dependencies. Drag-and-drop file upload, real-time analysis results rendered from the JSON response, severity badge, per-module finding cards with plain-English explanations. Designed to be readable by a non-technical employee — not just a security analyst.
What Ladon Does

✓  Reads raw bytes from the uploaded file

✓  Parses file structure against known patterns

✓  Scores findings by severity

✓  Returns structured results — no side effects

✓  Discards the file immediately after analysis

Zero Execution Guarantee

✗  Does not execute or interpret the file

✗  Does not open it in a renderer or viewer

✗  Does not follow embedded URLs

✗  Does not store or share the uploaded file

✗  Does not send it anywhere — ever

Why static analysis matters here: Dynamic analysis — sandboxing, executing the file and observing behavior — is powerful but requires an isolated environment and carries real risk. Static analysis reads the structure and content of a file without triggering anything. For a triage tool designed to be used by non-technical employees before they open a suspicious attachment, static analysis is the only appropriate approach. The file cannot detonate during analysis because nothing is interpreting it.

Section 03

The Detection Modules —
What Each One Looks For

Each module targets a specific attack surface independently. Their findings feed into a shared severity score. A file can trigger multiple modules at once — a malicious PDF with an embedded QR code pointing to a typosquatted domain would fire PDF structure, polyglot detection, barcode analysis, and URL analysis simultaneously. The modules were built one at a time with Claude — defining what each needed to detect, testing against synthetic samples, then adjusting until the output looked right.

Module 1 — PDF Structure Analysis

The PDF specification includes a rich set of actions that legitimate documents rarely use but that attackers routinely abuse. The module scans the raw PDF byte stream for dangerous keys and counts occurrences of each.

PDF KeyWhat It DoesWhy It's Dangerous
/JavaScript /JSEmbeds executable JavaScriptRuns code when document opens
/OpenActionTriggers action on document openAuto-executes without user interaction
/AAAdditional ActionsExecutes code on page open/close/print
/LaunchLaunches external applicationCan spawn shell commands
/EmbeddedFileContains embedded file attachmentDropper for secondary payload
/ObjStmObject streamCommonly used to hide malicious objects from scanners
/URIExternal URI referenceTracking pixel — phones home on render
/SubmitFormForm submission actionData exfiltration on open
Core PDF Analysis Loop
for key, description in PDF_DANGEROUS_KEYS.items():
    if key in data:
        count = data.count(key)
        results['dangerous_keys'].append({
            'key':         key.decode('latin-1'),
            'description': description,
            'count':       count,
        })

Module 2 — Polyglot Detection

A polyglot file is simultaneously valid in two different file formats. A PDF that also contains a Windows PE executable is the most common attack variant — the file passes type checks because it has a valid PDF header, but it carries a runnable executable in its body. Ladon scans the entire file for secondary magic byte signatures, not just the first few bytes.

Secondary Signature Scan
secondary_checks = [
    (b'\x50\x4b\x03\x04', 'ZIP/PKPASS/DOCX',  50),
    (b'\x25\x50\x44\x46', 'PDF',               50),
    (b'\x4d\x5a',         'Windows PE (EXE)', 50),  # MZ header
    (b'\x7f\x45\x4c\x46', 'Linux ELF',         50),
]

for magic, description, min_offset in secondary_checks:
    idx = data.find(magic, min_offset)  # skip first 50 bytes
    if idx != -1:
        findings.append({
            'type':     'SECONDARY_SIGNATURE',
            'detail':   f"{description} signature found at offset {idx}",
            'severity':  if 'EXE' in description else 'HIGH',
        })

Module 3 — Image Steganography

Steganography hides data inside something that doesn't look like it contains data. Three sub-checks run on every image: appended data after the EOF marker, LSB (Least Significant Bit) distribution analysis, and Shannon entropy scoring. Legitimate images have slightly uneven LSB distributions. LSB steganography produces near-perfect 50/50 distribution — statistically detectable.

LSB Distribution Analysis
for pixel in sample:
    for channel in pixel:
        lsb = channel & 1  # extract least significant bit
        lsb_counts[lsb] += 1

deviation = abs(ratio_0 - 0.5)
# LSB steganography: deviation < 0.02
# Natural images: deviation > 0.05
suspicious = deviation < 0.02

Module 4 — Audio Steganography

WAV files have explicit chunk sizes declared in their RIFF header. Any bytes after the declared chunk boundary are appended data — not part of the audio. This is the exact delivery mechanism used in the TeamPCP/Telnyx supply chain attack, where malware was hidden inside a WAV file named ringtone.wav and extracted in memory.

WAV Chunk Boundary Check
if data[:4] == b'RIFF' and len(data) > 8:
    declared_size = struct.unpack('<I', data[4:8])[0] + 8
    actual_size   = len(data)
    if actual_size > declared_size:
        appended_bytes = actual_size - declared_size
        entropy        = shannon_entropy(data[declared_size:])
        # Flag as CRITICAL — data exists past audio boundary

Module 5 — URL Analysis

URLs extracted from any module feed into a shared analysis pipeline. Each URL is checked against a known-safe allowlist, URL shortener list, suspicious TLD list, and a Levenshtein distance typosquatting check against known legitimate domains. The typosquatting check computes edit distance — a domain two characters different from a known airline or travel service is flagged as a potential impersonation.

Module 6 — QR Code Analysis

QR codes inside images are decoded using pyzbar, which handles QR Code Model 1/2, Micro QR, PDF417, and DataMatrix. The decoded URL is fed into the URL analysis pipeline. This module exists because of a documented attack pattern: boarding passes with QR codes replaced by codes pointing to credential harvesting pages. The QR code looks identical visually. Only decoding and analyzing the destination reveals the substitution.

Module 7 — ICS Calendar Analysis

Calendar invites are a persistent delivery mechanism — they are auto-added to the recipient's calendar even if the email is quarantined. The ICS module extracts URLs from the LOCATION, DESCRIPTION, ATTACH, and COMMENT fields and checks for meeting platform spoofing: an invite that claims to be a Zoom or Teams meeting but embeds a URL pointing to a different domain.

Meeting Platform Spoof Detection
MEETING_PLATFORMS = {
    'zoom':        'zoom.us',
    'teams':       'teams.microsoft.com',
    'meet':        'meet.google.com',
    'webex':       'webex.com',
    'gotomeeting': 'gotomeeting.com',
}

for platform, legit_domain in MEETING_PLATFORMS.items():
    if re.search(rf'\b{platform}\b', domain_part) \
       and legit_domain not in url_lower:
        findings.append({'type': 'MEETING_PLATFORM_SPOOF', : })
Section 04

Severity Scoring —
How Findings Become a Verdict

Each finding contributes points to a composite severity score. The score determines the final verdict — CRITICAL, HIGH, MEDIUM, or LOW. The thresholds are deliberately conservative: a single CRITICAL finding from polyglot detection (score += 5) puts a file over the HIGH threshold immediately.

Finding TypePointsRationale
Polyglot — CRITICAL (EXE/ELF inside file)+5Executable inside document is never legitimate
Polyglot — HIGH (ZIP inside image)+3Unusual but not always malicious
PDF dangerous key (JavaScript, OpenAction, Launch)+4Actively dangerous — no legitimate use in typical docs
PDF structural key (ObjStm, URI, etc.)+2Suspicious context — common in malicious docs
Appended image data+4Data after EOF is almost never legitimate
LSB steganography suspected+3Statistical anomaly — possible hidden payload
Audio appended data+5WAV with data past chunk boundary — confirmed technique
URL risk — CRITICAL+5Typosquat, homoglyph, or confirmed malicious
URL risk — HIGH+3Shortener, suspicious TLD
ICS meeting spoof+5Deliberate platform impersonation
Severity Thresholds
if   score >= 8:  return 'CRITICAL'
elif score >= 4:  return 'HIGH'
elif score >= 1:  return 'MEDIUM'
else:            return 'LOW'

Design choice — conservative thresholds: The scoring is intentionally tuned to minimize false negatives over false positives. A file with a suspicious TLD (MEDIUM, score=1) should prompt review even if nothing else fires. A file with an embedded EXE (score=5) is immediately HIGH regardless of other findings. For the intended audience — a contractor's employee who receives an unexpected PDF — the cost of a false positive (reviewing a clean file) is far lower than the cost of a false negative (opening malware).

Section 05

The Flask Backend —
Serving Analysis as an API

The Flask layer is intentionally thin. Its job is to accept a file, call the engine, and return JSON. All the intelligence lives in document_triage.py. This separation means the analysis engine can be used directly from the command line, called from other scripts, or exposed through a different interface entirely — the web layer is not load-bearing for the detection logic.

ladon_server.py — The Analyze Endpoint
@app.route('/analyze', methods=['POST'])
def analyze():
    file = request.files.get('file')
    if not file:
        return jsonify({'error': 'No file uploaded'}), 400

    # Save to temp file — analyze — discard
    with tempfile.NamedTemporaryFile(
        suffix=Path(file.filename).suffix,
        delete=False
    ) as tmp:
        file.save(tmp.name)
        try:
            results = analyze_file(tmp.name)
            return jsonify(results)
        finally:
            os.unlink(tmp.name)  # always delete — no persistence

The finally block is not an afterthought. It guarantees the uploaded file is deleted regardless of whether analysis succeeds or throws an exception. A malicious file that crashes the parser is still deleted. The server never accumulates files.

On Railway deployment: Railway provides automatic deploys from GitHub — every push to main triggers a rebuild. The service runs on a paid plan ($5 deposited) to avoid cold starts on free tier. The URL is kept private during the testing and IP protection phase. Auto-deploy means the analysis engine and backend always stay in sync — no manual deployment steps.

Section 06

Decisions Made Along the Way

Some of the most useful things to document are the decisions that aren't visible in the final code — the things I considered and rejected, and why. These came up repeatedly during the build.

Why not send files to VirusTotal?

The obvious first thought was to integrate VirusTotal. Then I thought about who the tool is for. A defense contractor submitting a suspicious vendor PDF to VirusTotal might inadvertently tip off the attacker that their document was analyzed — VirusTotal results are visible to subscribers. That disclosure risk is real, and it's the kind of thing that came from reading how analysts actually work rather than from writing code. Static local analysis means the file goes nowhere. Ladon reads it and discards it. Nothing leaves the system.

Why not use a sandbox for dynamic analysis?

Sandboxing is the professional answer for malware analysis. But sandboxing requires executing the file — which means you need an isolated environment, you need to know what you're doing, and you need to accept that something is going to run. The question Ladon is trying to answer is "is this safe to open?" Executing it to find out defeats the purpose. Static analysis answers the structural question without triggering anything.

Why Python and not a compiled language?

Python was Claude's recommendation for this use case and it turned out to be the right one. Speed wasn't the concern — documents are small files and analysis is fast regardless. Python's standard library handles binary parsing, Pillow and pyzbar handle images and barcodes, and the code is readable enough that I can follow what it's doing, spot when something looks wrong, and describe the fix I want clearly. That last part matters — I'm not writing the code independently, but I need to be able to evaluate whether what was written actually does what I intended.

Why a single HTML file for the frontend?

Simplicity. A single HTML file can be opened locally in a browser without a server, deployed anywhere, and inspected by anyone. No build pipeline, no bundler, no dependencies to audit. The frontend is not the interesting part of Ladon — the analysis engine is. Keeping the frontend simple keeps the focus where it belongs.

The export function: The tool generates a plain-text export of every analysis session — filename, timestamp, severity, findings per module. The export format is intentionally simple: readable by a non-technical manager, pasteable into a ticket, attachable to a CMMC audit record. The timestamp is UTC. The footer attributes the export to SiteWave Studio LLC. Every design choice in the export serves the compliance documentation use case.

Section 07

What Was Learned

Reading file format specifications to understand what was designed to be there — and then reading threat reports to understand what attackers put there instead — is where most of the build time went. The PDF spec is 756 pages. WAV chunk boundaries come from a 1991 IBM/Microsoft document. ICS is RFC 5545. None of them were written with attackers in mind, which is exactly why they're useful attack surfaces.

The typosquatting check took the most iterations to get right. The first version Claude wrote flagged too many legitimate domains — a travel booking site with a domain two characters different from "booking.com" is not necessarily a typosquat. I kept testing it against real URLs, flagging what looked wrong, and describing the adjustment needed. Tuning the edit distance threshold and minimum domain length down to something that caught real typosquats without flagging legitimate sites took several rounds of that back-and-forth.

The ICS meeting platform detection had the same problem. The first version checked if a platform keyword appeared anywhere in the URL string — which immediately broke on real-world cases I tested. A Proofpoint-wrapped Zoom URL has "zoom" buried in the destination but the actual domain is Proofpoint. A Calendly-scheduled Teams meeting says "teams" in the description but "calendly.com" in the URL. I found these by testing edge cases, described what was going wrong, and Claude rewrote the check to look at the domain portion only with word boundary matching. That's the pattern throughout the build — I find what's broken, describe what correct behavior looks like, and iterate until the output matches my expectation.

The most useful thing about building this way: the tool is only as good as the edge cases you think to test. Synthetic test files tell you the code runs. Real malware tells you the detection logic actually works. That validation is documented in Lab Log 011.

REQUEST A LIVE DEMO
available upon request

Ladon is a static analysis tool — it reads file structure and content without executing anything. All analysis runs server-side on Railway infrastructure. No files are stored after analysis. The tool is currently in private testing. SiteWave Studio LLC · Built by Yana Ivanov · April 2026