SoftwareCitation Labs — 3rd co-op (Speed School) · Jan – May 2026

YouTube Citation Extractor API

Detects whether an LLM's YouTube citation actually supports its claim — or hallucinates it.

Role: Backend / NLP (co-op)

Overview

A Python REST API plus batch CLI that analyzes whether YouTube videos cited by LLMs actually support the claims made. Built during my Citation Labs co-op, it runs a multi-stage NLP pipeline — transcript extraction, NER, entity-set comparison, semantic relevance scoring, and local-LLM summarization — to label each citation as grounded, ungrounded, or topic-mismatch. The live demo on Vercel visualizes real production results.

Highlights

REST API (FastAPI) extracts transcript, metadata, and thumbnail per YouTube video; transcription via faster-whisper when subtitles are missing.
Batch pipeline: spaCy NER + rapidfuzz dedup, nomic-embed-text relevance scoring, and Mistral-via-Ollama summarization to label and score hallucination risk.
Production runs across 812 citations (GPT, Gemini, AI Overview, AI Mode) — 77–91% of cited videos found to be hallucinated.
Static Next.js demo site (Tailwind, Framer Motion, Radix) renders the batch results — no runtime API calls.

The finding

Across 812 citations spanning GPT, Gemini, AI Overview, and AI Mode, the vast majority of cited YouTube videos do not contain the entities or claims the AI discussed. All platforms hallucinate at similar rates (77–91%), with no platform substantially more reliable than the others.

Pipeline

CSV of citations → per-video transcript extraction (yt-dlp / Whisper) and ASR normalization → spaCy NER with fuzzy deduplication → entity-set comparison (supported / missed / unsupported) → embedding-based relevance score → local-LLM summarization and gap analysis → categorical label + hallucination score → executive CSV, full CSV, per-row JSON, and SQLite output.