Unlocking web archives: LLMs, RAG, and the future of digital preservation

Date

2025-02-28

Authors

Davis, Corey

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Large Language Models (LLMs) are transforming how research libraries manage digital preservation and access to web archives. This paper examines the potential and challenges of integrating LLMs with Retrieval-Augmented Generation (RAG) to enhance the searchability and usability of Web ARChive (WARC) files. Traditional keyword-based retrieval often falls short in handling the complexity of web archives, necessitating new AI-driven approaches. The study explores WARC-GPT, an open-source tool developed by the Harvard Law Library Innovation Lab, which applies RAG techniques to enable conversational search across web archives. While WARC-GPT demonstrates promise, it also encounters significant hurdles, including noisy data, hallucinations, and computational inefficiencies. To address these issues, the author develops a bespoke RAG pipeline optimized for research library needs, implementing improvements in data preprocessing, chunking strategies, and hardware acceleration. The results highlight the potential for AI-enhanced discovery while underscoring the technical, ethical, and resource-related challenges that libraries must navigate. This paper argues that while AI-driven tools offer new avenues for digital preservation, their successful deployment requires careful design, iterative refinement, and human oversight. The future of AI in research libraries will not replace human expertise but will instead rely on a balanced interplay between automation and curation.

Description

Keywords

Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), web archives, digital preservation, research libraries

Citation

Davis, C. (2025). Unlocking web archives: LLMs, RAG, and the future of digital preservation. University of Victoria Libraries.