Unlocking web archives: LLMs, RAG, and the future of digital preservation

Davis, Corey

Unlocking web archives: LLMs, RAG, and the future of digital preservation

Files

davis_corey_unlockingWebArchives_2025.pdf (2.62 MB)

Date

2025-02-28

Authors

Davis, Corey

Abstract

Large Language Models (LLMs) are transforming how research libraries manage digital preservation and access to web archives. This paper examines the potential and challenges of integrating LLMs with Retrieval-Augmented Generation (RAG) to enhance the searchability and usability of Web ARChive (WARC) files. Traditional keyword-based retrieval often falls short in handling the complexity of web archives, necessitating new AI-driven approaches. The study explores WARC-GPT, an open-source tool developed by the Harvard Law Library Innovation Lab, which applies RAG techniques to enable conversational search across web archives. While WARC-GPT demonstrates promise, it also encounters significant hurdles, including noisy data, hallucinations, and computational inefficiencies. To address these issues, the author develops a bespoke RAG pipeline optimized for research library needs, implementing improvements in data preprocessing, chunking strategies, and hardware acceleration. The results highlight the potential for AI-enhanced discovery while underscoring the technical, ethical, and resource-related challenges that libraries must navigate. This paper argues that while AI-driven tools offer new avenues for digital preservation, their successful deployment requires careful design, iterative refinement, and human oversight. The future of AI in research libraries will not replace human expertise but will instead rely on a balanced interplay between automation and curation.

Keywords

Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), web archives, digital preservation, research libraries

Citation

Davis, C. (2025). Unlocking web archives: LLMs, RAG, and the future of digital preservation. University of Victoria Libraries.

URI

https://hdl.handle.net/1828/21379

Collections

Faculty and Staff Publications

Full item page

Unlocking web archives: LLMs, RAG, and the future of digital preservation

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections