Issues in Science and Technology Librarianship
Winter 2016

Short Communications

Preserving the Digital Record of Science and Engineering: the Challenge of New Forms of Grey Literature

Linda Musser
Earth and Mineral Sciences Library
Pennsylvania State University
University Park Pennsylvania


Research communications today are largely conducted by digital means. At this time, however, only a small percentage of these digital communiques are archived and preserved for future use. This article provides an overview of the challenge of this digital grey literature, a brief overview of digital archiving, and the role librarians and researchers can play in facilitating the availability and accessibility of these resources for future generations.


Communication today is primarily in digital format. From the research literature published in electronic journals to e-mail, greater percentages of communication are conducted in digital form than ever before. At this time, however, only a small percentage of these digital communiques are archived and preserved for future use. In some respects this is not a new phenomenon. Prior to the digital revolution, print communications such as correspondence, working papers, etc. were not preserved, nor was most audio such as telephone and personal conversations. Only the most important communiques and decisions were stored in print -- final reports, patents, budget documents, and other summative publications. In our current era, e-mail has largely replaced audio conversations and working 'papers' are now in digital form. With so much digital material, it is easy to lose track of the summative documents and most important communiques in all the clutter. In the past, it was possible to retrieve a file of correspondence from the filing cabinet or storage box, regardless of whether the persons involved remained with the organization. Today, one is less likely to be able to do so given that most of our digital footprint is tied to individual accounts, which are typically erased upon that individual's departure. Even if an individual remains with the organization, software migrations, changes in file formats, and calls from network managers to clean up and delete old files from servers add to the challenge of retrieving old project documents and correspondence. Organizations with archivists or records managers generally do a good job of informing personnel and managing intellectual property content, however, the bulk of responsibility resides with the individual to identify and preserve appropriate content. Even more problematic to preserve are resources based in the cloud and open Internet, including enterprise documents, web pages, social media sites, and communication media such as blogs and mailing lists.

Digital Archiving

There are a variety of ways to categorize digital archiving activities. It can be helpful to treat the topic in terms of preservation of internal digital resources versus those external to the organization. The former instance is largely the realm of the records manager and archivist within the organization. The latter instance is exemplified by the WayBack Machine, a service that crawls the web and periodically takes 'snapshots' of web sites. Sometimes called web archiving, the phrase can also encompass archiving of internally-created web content.

The practice of web archiving began in 1996 with activities by some national libraries and the Internet Archive, host of the WayBack Machine (Niu 2012). (The Internet Archive is a non-profit organization with the goal of building a digital library of Internet sites.) Interest and involvement of more organizations increased and in 2003 the International Internet Preservation Consortium was formed to facilitate collaboration and develop tools and standards related to archiving of web content. As was originally the case, many of the major players involved in web archiving are national libraries such as the Library of Congress and National Library of New Zealand. Most of these libraries focus on archiving web-based materials related to their country. New Zealand's policy is representative of many others. Per their web harvesting page:

"Web harvesting is a term used by the National Library to describe the selecting, copying and archiving of websites found on the internet. The collection of New Zealand websites is covered by Legal Deposit legislation (National Library of New Zealand Act 2003, Part 4)...If you know of a New Zealand or Pacific website you feel should be in the collection do let us know. If you're the copyright owner of a Pacific Island website or a New Zealander publishing a website overseas, you are most welcome to nominate your own site. This will assist us in the permissions process, as overseas websites are not covered by New Zealand Legal Deposit legislation" (National Library of New Zealand).

In addition to archiving various governmental sites, the Library of Congress builds subject-based collections related to events such as elections and the Iraq War. Beyond these organizations, there are relatively few players although academic institutions are beginning to become more involved. Columbia University Library uses the Archive-It service to not only collect their internal web domain but also collect digital objects related to special subjects of particular interest, such as New York City (Archive-It 2013). The University of Iowa has also used the Archive-It service to build a collection of Internet resources related to the 2008 flood of their campus (University of Iowa Libraries).

One of the keys to greater involvement in web archiving has been access to technologies to automate the process of crawling the Internet. Services such as Archive-It, Smarsh, and {Reed Tech Web Archiving} offer various options to facilitate the creation of digital collections. New tools, particularly those that focus on social media, continue to be developed (Hedges 2015). Some of these developments are in response to regulatory requirements but, as markets and researchers become more aware of the potential uses of this content, interest is increasing as is the pace of development of new tools.

Role of the Scientist and Engineer

Scientists and engineers have a role to play in preserving appropriate records of their work, in forms that are accessible to future users. As such, it is useful to be familiar with basic rules for documentation such as those currently being urged for use with data (DataOne). In terms of e-mail, the organization may routinely back up and archive e-mail using the organizational domain and retain or destroy it according to the records management schedules. Ownership of that mail generally resides with the employer; hence it is good practice to keep personal and business e-mail in separate accounts. For many, their employer, be it a government agency or private enterprise, will have guidelines for what items to preserve. The National Aeronautics and Space Administration and the Pennsylvania State University provide representative examples of such guidelines (National Aeronautics and Space Administration; Pennsylvania State University).

Role of the Science/Engineering Librarian

Librarians have similar responsibilities in terms of preserving appropriate records of their work, in formats accessible to future users. As information professionals, librarians can advise users regarding preservation methods and actively promote good practices, including recognition of the value of new forms of grey literature such as e-mail and electronic lab notebooks. In addition, librarians can leverage their expertise in resource selection to build specialized collections of digital works related to particular events or topics that will be of interest to future researchers. One illustration of this is the Japan Earthquake collection developed by Virginia Tech University, which consists of thousands of pages (web sites, blogs, mass media, etc.) describing the events surrounding the 2011 earthquake and tsunami in Japan (Virginia Tech Crisis Tragedy and Recovery Network). Other subjects that could be deserving of attention include big science or engineering projects, such as the Keystone XL Pipeline project, or projects of local/regional interest, such as the Marcellus shale gas boom (Bohn 2015).


In his 2009 article, Banks argues that traditional grey literature is much more findable than in the past, and urges readers to concentrate on preserving the ephemeral grey literature characterized by blog posts and tweets (Banks 2009). Preserving these and other digital formats -- the Web 2.0 grey literature -- will require forethought on the part of scientists, engineers and librarians. Scientists and engineers need to be proactive in making the digital records of their work identifiable and accessible to future users. Efforts to manage electronic lab notebooks are a good example of progress in this area (Bird et al 2013). Librarians need to become more familiar with web archiving tools in order to assist archivists and others endeavoring to identify appropriate, non-traditional resources, i.e., grey literature, to preserve. As described by Fleming et al., social media is hugely influential yet digital formats, such as blogs, tweets, e-mail and web pages do not easily conform to traditional library acquisitions methods (Fleming et al. 2009). From technical reports to specifications, science and engineering librarians have long appreciated the value of grey literature. It is time to expand our appreciation and attention to the newest forms of grey literature in digital form.


