CSPreprints.Dev

Preprint

Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

by Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh

CSPreprints.Dev preprint, version posted May 15, 2026.

May 15, 2026Computer Science

Abstract

Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results, with no view of how the corpus is organized or what it has not yet seen. We present Corpus2Skill, which distills a document corpus offline into a hierarchical skill directory and lets an LLM agent navigate it at serve time, drilling from a bird's-eye view through progressively finer summaries down to documents, and backtracking when a branch is unproductive. On an enterprise customer-support benchmark, Corpus2Skill improves both answer quality and grounding over single-shot dense, hybrid, hierarchical-retrieval, and agentic RAG baselines at a moderate cost tradeoff. A ten-subset generalization study further shows that corpus navigation is not a universal replacement for retrieval: it consistently helps on single-domain corpora with a recoverable topical taxonomy, but flat retrieval remains preferable on open-domain factoid pools or homogeneous-tabular corpora that defeat top-level clustering. We characterize this scope distinction and discuss it as a design guideline for knowledge-grounded systems. Code is available at https://github.com/dukesun99/Corpus2Skill.

Keywords

RAGLLM AgentsAgent Skills