Skip to main content

Digital Preservation in Special Collections: Processing Digital Records

Documentation of digital preservation efforts in UB Libraries Special Collections.
Last Updated: Jan 16, 2020 10:47 AM

Introduction

Archival processing occurs over several phases. Steps include communicating with donors, accessioning, gathering information about the records, reviewing the records for preservation needs and sensitive information, establishing an arrangement, describing the records, and providing access. There are many degrees to which archivists process records, some of which are more intensive than others, but all of which should be done well. Digital records require a broad approach due to the large quantity of files in any given accession. UB Special Collections has chosen to adapt and reuse the Digital Processing Framework as a guideline for processing our born-digital collections. This framework, developed by ten archival practitioners after a meeting at the 2016 Born Digital Archiving eXchange unconference, is adaptable, reusable, and assists processors in the various stages of born-digital processing.

Broadly defined processing steps are outlined below. Processing checklists for minimal processing, moderate processing, and intensive processing are also available for download. These three processing tiers correspond to levels outlined in Guidelines for Efficient Archival Processing in the University of California Libraries and UB Special Collections' Processing and Description LibGuide.

Working with Donors

When dealing with digital records, donors may require extra guidance. The Donor Guide gives donors thorough step-by-step instructions to ensure successful transfer of their records. The guide covers the initial collection review, privacy issues, copyright and intellectual property, the deed of gift, preparing digital records for transfer, preparing email for transfer, transferring digital records, and completing the donation. Corresponding staff instructions can be found below.

  • Share the Donor Guide and walk through the checklist carefully
  • Clarify expectations regarding digital preservation and access
  • Ask the donor the following questions to gain a broad understanding of the digital records
    • Location of computer workstations (on campus or at home)
    • Removable storage media (hard drives, USB flash drives, CDs, etc.)
    • Approximate date range of files
    • File types
    • Approximate quantity
  • Conduct an initial survey of born-digital materials:
  • Determine the appropriate survey method by consulting with the donor
  • Conduct onsite surveys (in person) when possible
  • Before acquisition, ask if the collection includes legally protected files, such as educational, medical, financial, criminal, attorney-client, and personnel records
  • Inform the donor that the presence of sensitive information about individuals other than the donor or creator may limit access to their own files
  • Make clear who will screen email messages for sensitive information and what the process will be. *Refer to Section VIII. Working with Email for further instruction*
  • Ask the donor to complete the Privacy checklist in the Donor Guide
  • Ask the donor if there is other people’s intellectual property in the donor’s files
  • Discuss any rights held by individuals other than the donor or creator that would limit Special Collections ability to provide access and delivery of the files to researchers
  • Discuss the methods by which Special Collections will provide researchers with access to the digital files
  • Identify which born-digital materials are to be offered to Special Collections.
  • Ask the donor to document the ways in which digital media and files have been stored, accessed, and transported prior to their arrival or collection by Special Collections.
  • Complete the table below with the donor. Indicate which file types are included in the donation:
  • Audio
  • Raster images
  • Databases
  • Raw Camera images
  • Email* (see additional information below)
  • Spreadsheets
  • Open Office
  • Vector images
  • Plain Text
  • Video
  • Portable Document Format
  • Word Processing files
  • Presentation files (i.e. PowerPoint)
  • Other: __________________________

The deed of gift acknowledges transfer of ownership to Special Collections.

  • Review the Deed of Gift
    • Transfer of copyright
    • Permission to make preservation and access copies
    • Permission to display online
    • Disposition of duplicate or unneeded materials
    • Disposition of computer hardware, removable media and files not retained
  • The donor and Special Collections staff representative sign the Deed of Gift
  • Provide a copy of the signed deed to the donor
Loading ...

Digital Processing Checklists

Storage Media Inventory

Donors may choose to transfer digital records via removable storage media. If so, before Special Collections staff copy the digital records from the physical media, it is important to collect as much information as possible about the storage media itself.

  • Record the following information in a media inventory spreadsheet. Name this file [accession number]-media-inventory.xlsx. Save this spreadsheet in the accession file. 
    • Accession number - the accession number assigned to the collection 
    • Unique ID – a unique ID assigned to each piece of storage media (ex. accession#–sequential number, 2017-001-01; 2017-001-002) 
    • Label information – a transcription of any pre-existing labels on the media
    • Date(s) -  The creation date or date range of the files stored on the media
    • Storage media type - The type of media
    • Storage capacity – maximum storage capacity of the media, expressed in gigabytes
    • Physical dimensions – dimensions of the media expressed height x width 
    • Manufacturer – manufacturer of the media 
    • Model/Series - if known, the model and series of the media 
    • External parts – list of any other parts that came with the media (cables, chargers, etc.)
Loading ...

Related Tools

An important part of working with digital records is to ensure their health from viruses, especially as these records will be placed in long-term preservation storage. Special Collections runs virus checks on all donated digital records with Symantec Endpoint Protection.

  1. Open Symantec Endpoint Protection
  2. In the sidebar, click Scan for threats
  3. Click Create a New Scan > Custom Scan
  4. Specify the folder containing the working copy files
  5. In the Create New Scan – Scan Option dialog box, select All File Types, uncheck the remaining fields
  6. In the Create New Scan – When to Run dialog box, click On Demand
  7. Click Next
  8. In the Create New Scan – Scan Name dialog box, type the name of the accession
  9. Click Finish.
  10. Under Scan Name  right click the newly created scan and select Run
  11. Should any viruses or threats be detected, stop the workflow and reassess the donation 

Instructions for creating a file manifest with DROID

  1. Open the DROID application, a new profile will be created automatically
  2. Select Add and browse to the digital records in the accession folder
  3. Select Start, this will scan the directory and run the file identification process
  4. Expand the directory to view file details (folder names, file names, extension, size, last modified date, format, PUID, Hash value)
  5. Select Export, a new window will open. Choose the profile, select Export profiles...
  6. Under Encoding select .csv, name this file [accessionnumber]-file-manifest.csv, save in the accession folder

Sometimes files are missing extensions or the extension is mismatched. When this occurs, DROID will issue a file extension warning mismatch.

  1. Use the LocateOpener application to identify and append missing file extensions
  • Right click on files missing extensions or on files with unfamiliar extensions
  • Select LocateOpener from the menu
  • If the file has an unknown extension, select “Scan with TrID”
  • A new window will recommend an appropriate file extension
  • Select “Append extension”
  • Rescan the accession to create an updated file inventory with amended file extensions
  1. Using the inventory, identify archive formats (gzip, bzip, wbzip2, zip, jar, tar) and unzip them prior to ingest into Preservica
  2. If unzipping a package with a mbox file(s), delete the unzipped folders after copying the mbox file(s) to the appropriate location in the accession folder
  3. Remove empty directories
  4. Remove .DS_Store and .Thumbs file
  5. If file extensions required amendment and/or archives formats were unzipped, create an updated file inventory by following the Inventory instructions above. Replace the file-inventory.csv file with the updated version

As with paper records, donors may unknowingly transfer collections that contain duplicate files. Duplicate files are identified by comparing each file’s checksum. Files with the same content have the same hash value.

  1. In Excel, open the file inventory .csv file
  2. Highlight the HASH column
  3. Select the Home section > Conditional Formatting > Highlight Cells Rules > Duplicate Values
  4. Choose a highlight color
  5. Excel will highlight all duplicate values
  6. Return to the working copy files in the accession folder
  7. Locate the duplicate files and move the files to a new Duplicates folder in the accession file
  8. Retain this folder until Preservica ingest is complete

As with traditional paper collections, digital records may contain sensitive or confidential information that is protected under federal or state “right to privacy” laws, including but not limited to certain education, medical, financial, criminal, attorney-client, and personnel records. Special Collections staff takes care to identify and, in some cases, removed personally identifiable information (PII) found within all archival collections.

Special Collections uses a tool called Identity Finder to search for and identify PII in digital collections.

  • Run the Identify Finder Enterprise software on the working copy files
  • Conduct this step even if donors say there is no PII in their files
  • Open Identity Finder Enterprise software and log in
  • Under “Start” select “Start Search Wizard”
  • Under “Anyfind Searching” select the personally identifiable information (PII) you want to search for. The default types are: Social Security Numbers, Credit Card Numbers, Password Entries, Bank Account Numbers, Drivers Licenses, Passport Numbers, and Health Information

  • Select “Next”
  • Select the Locations (usually “Files and Compressed Files” and "E-Mails and Attachments")
  • Under “File Locations” browse to the custom location. This is the accession folder or specific folder(s) in the accession folder you want to search

  • Review and confirm the selection, select “Finish”

  • Select “Save As” to generate a report. This report contains the file path (location), date modified, size, PII category, and match, number of hits and classification information of the files that contain PII
  • Add two additional columns to the end of the report: Action and Ingest
  • Record any action taken on these files in these fields
  • Save this report as an Excel file in the accession folder
  • Return to Identity Finder and review the files that contain PII
  • In consultation with the processing archivist, determine whether or not these files need action
  • Under the Action menu, select “Scrub” or “Shred”*

Scrubbing (Redacting) – When a file contains sensitive information and you wish to keep the file but remove the sensitive information only, you should utilize the scrub feature.

Shred – When a file contains sensitive information and you wish to remove the file from the accession, you should utilize the shred feature. Using the shred feature will delete the entire file.

 


 

  • Update the Excel report file with all actions taken, including if files were skipped because the information was determined to not be PII. Also note if the file was re-ingested into Preservica**

*Identity Finder only allows actions on certain file types and versions. For a complete list, click here. It may be necessary to migrate files to newer versions or different formats before taking action. Always work on a working copy file, never on the original.

**If PII is identified after ingest into Preservica, download the file(s), follow steps 1-11 above, then re-ingest the files using the SIP Creator, and finally, delete the original un-redacted files in Preservica.

Loading ...

Preservation

Special Collections uses Preservica, a suite of OAIS (Open Archival Information System) compliant workflows, to manage our digital records. This system assists with the accessioning, processing, and preservation of our digital materials.

Special Collections’ primary strategy for preservation is to normalize files for preservation and presentation. The Digital Preservation Coalition defines digital preservation as “the series of managed activities necessary to ensure continued access to digital materials for as long as necessary…and refers to all of the actions required to maintain access to digital materials beyond the limits of media failure or technological and organisational change.”[1] The DPC defines migration as “a means of overcoming technological obsolescence by transferring digital resources from one hardware/software generation to the next. The purpose of migration is to preserve the intellectual content of digital objects and to retain the ability for clients to retrieve, display, and otherwise use them in the face of constantly changing technology. Migration differs from the refreshing of storage media in that it is not always possible to make an exact digital copy or replicate original features and appearance and still maintain the compatibility of the resource with the new generation of technology.”[2]

The workflow for ingesting and preserving digital records in Preservica varies depending on a number of factors: types of files in the accession, size and arrangement of the accession, and the means of physical transfer. For example, donors transferring their files remotely by web application will utilize a different application than Special Collections staff who has received a donation of digital records via removable storage (i.e. a thumb drive). Donations of large quantities of digital records are uploaded to Preservica in batches to mitigate the time it takes to run normalization workflows on the entirety of the donated materials. Run workflows for normalization during ingest into Preservica or after ingest in Preservica depending on the type of file.

[1] http://www.dpconline.org/handbook

[2] http://www.dpconline.org/handbook/glossary#M

Special Collections staff have two tools to choose from when uploading accessions to Preservica. The first, the Preparation and Upload Tool (PUT), prepares accessions and generates the required XML needed for a Preservica ingest.

The second tool is the Upload Wizard. Record producers (donors) submit content to Preservica using this standalone Java desktop application. Special Collections staff use this tool when performing an in-person transfer of a donor’s records from the donor’s on-campus workstation. The Upload Wizard can also be used when ingesting unpacked packages into Preservica (i.e. zip, iso, mbox, cmp, or pst).

  • Choose the appropriate Preservica upload tool
  • If a remote transfer by donor is planned, provide the donor with Upload Wizard instructions and corresponding metadata

Note: Prior to starting these steps the processor should have a created a working copy of the files/folders, inventories of the files, and file-level and folder-level metadata xml files. 


Collect and record the following information about the folder(s)/file(s) you are preparing for upload:
  • Size (in GB)
  • Number of files
  • Folder title(s)
  • Files date(s): creation AND modified
  • ​File type(s)

Note: much of this information is located in the file manifests and metadata files generated during earlier steps in processing.

Preservica offers multiple migration workflows that allow users to select at-risk formats and their target preservation and presentation formats. Using transformation pathways, Preservica migrates files from the original format to resulting format (preservation or presentation). Preservica retains the original file to support future preservation needs.

The following instructions are intended to guide Special Collections staff through preservation workflows in Preservica. These steps will identify at-risk file formats, select appropriate target preservation formats, choose correct migration pathways, and perform file migration.

Identifying At-Risk File Types

The first step in preserving digital records is to identify at-risk file types. File formats and PUIDs were identified during the preservation assessment prior to ingest. Use the information from the file inventory report and the Sustainable Formats table to identify at-risk files and appropriate preservation formats. Additionally, PUIDs can also be identified in Preservica by right clicking on the file, selecting Properties > Technical > PUID

File Normalization

Preservica offers multiple migration workflows.

  • Right click on the target folder or asset

  • Select Actions>Create new Representation

  • Choose "Select Representation Type"

  • Choose the Representation Type to Create" and select "Continue"

  • Choose "Select Business Rules"

  • Choose the appropriate migration pathway for target assets and select "Continue"

Review the file formats identified by DROID and make note of at-risk file types. Identify the target preservation formats, access formats, and pathways if running normalization workflows on ingest into Preservica. Use the Sustainable Formats table seen below and the Preservica Format Registry. Record this information in the file extensions report.

Sustainable Formats

Media Type

File Formats

Preservation Formats

Access Formats

Archive

gzip, bzip, wbzip2, zip, jar, tar

Content of the archive is converted according to appropriate preservation format

Content of the archive is converted according to appropriate preservation format

Audio

ac3, aiff, mp3, wav, wma, midi, xmf, ogg, flac

wav

mp3

Databases

mdb, dat

siard

original format

Email

pst, mbox, nsf

eml

PDF/A, eml

Open Office XML

docx, pptx, xlsx

original format

original format

Plain Text

txt, csv, files containing ASCII or MIME data

original format

original format

Portable Document Format

pdf

PDF/A*

original format

Presentation files

ppt, pptx, odp

original format

PDF/A

Raster images

jpg, png, tiff, jp2, bmp, gif, pct, png, psd, tga,

tiff, JPEG 2000

jpg

Raw Camera images

raw, cr2, arw, dcr, mrw, nef, orf, pef, 3fr, x3f, crw, dng, erf, kdc, raf,

original format

jpg

Spreadsheets

xls

xls, ods

original format

Text (mark-up)

htm, xml

PDF/A**, xml, original format

PDF/A, xml

Vector images

ai, eps, svg

svg

svg, PDF/A

Video

avi, flv, mov, mpeg-1, mpeg-2, mpeg-4, swf, wmv, mj2, mxf, dv

mp2

mp4

Websites

htm/html, asp

xhtml, PDF/A**

xhtml, PDF/A**

Website archive

warc, arc

warc

warc

Word Processing files

doc, wpd, odt, rtf

original format, odt

PDF/A

 

*Preservica does not currently support PDF to PDF/A migration
**Review PDF/A files transformed from htm to ensure readability and that there is no loss of content

 

original format

Loading ...

Access

The Digital Preservation Coalition defines access as "...continued, ongoing usability of a digital resource, retaining all qualities of authenticity, accuracy and functionality deemed to be essential for the purposes the digital material was created and/or acquired for."[1] As such, the major purpose of preserving digital content is so that it will remain accessible to future users. Like the preservation workflows detailed above, Special Collections' primary strategy for presentation of digital records is to normalize files and provide access via Preservica's Universal Access WordPress site.

The following instructions are intended to guide Special Collections staff through presentation workflows in Preservica. These steps will identify at-risk file formats, select appropriate target presentation formats, choose correct transformation pathways, and perform file transformations.

At-Risk File Types

The first step in presenting digital records is to identify at-risk file types. File formats and PUIDs were identified during the preservation assessment prior to ingest. Use the information from the file inventory report and the Sustainable Formats table to identify at-risk files and appropriate presentation formats. Additionally, PUIDs can also be identified in Preservica by right clicking on the file, selecting Properties > Technical > PUID

Presentation Post-Ingest

Preservica offers multiple migration workflows. 

  1. Using the preservation assessment done during the accession process, select the target folder and review for at-risk file types
  2. Record the PUID of at-risk file types in the target folder. The PUID can be identified by looking in Properties > Technical > PUID
  3. To activate these workflows select the folder or asset, right-click and choose "Create new representation"
  4. Select "Select Representation Type"
  5. Choose the representation type and select "Continue"
  6. Choose "Select Business Rules"
  7. Choose the appropriate migration pathway from the menu
  8. Select "Continue"

Once the transformation is complete, review the files in Explorer

Refer to Preservica's ArchivesSpace Integration - Getting Started Guide, v5.9, section 4 

Refer to the Processing and Description: Digital Objects LibGuide maintained by UB's Special Collections.

Loading ...

Storage

Special Collections stores digital records on two different storage adapters: Amazon S3 and Amazon Glacier. S3 is primarily used for storing presentation files while Glacier is used to store preservation files. This is governed by a storage policy document and is activated by the Update Storage workflow. Typically, this workflow is run after the files have been processed, meaning the appropriate presentation and preservation manifestations have been created and all arrangement and description of the files is complete. Until this workflow is started, all files (both presentation and preservation) are stored on the S3 adapter.