Bayesian approaches to making data-driven digital repository purchasing decisions

(as published in the Canadian Journal of Library Management Science, June 1999)

Library science, in a rush to webify distributed preservation, has focused on the development of network ubiquitous portals. However, little credible work has been carried out concerning the likelihood of demand for artefact retrieval when making decisions about digital archival processes. In this paper we deploy probablistic models and extend bayesian thesauri in order to engineer robust sublanguages around common retrieval requests, and use these to make recommendations concerning investments in enterprise-level storage and search systems.

We conclude that the majority (76%, p=0.005) of search terms concern pictures of cats with amusing phrases, specifically including terms such as “cheezburger” “haz” “LOL” and “silly humins”, and recommend that library resources are diverted instead to the purchase of kittens, cardigans and gin.

(5 generated phrases:
network ubiquitous portals
deploy probabilistic models
webify distributed preservation
extend bayesian thesauri
engineer robust sublanguages