Digital Humanities: Transforming
Transforming Data with OpenRefine
OpenRefine is a free and open-source tool for performing mass transformations on tabular data.
OpenRefine's easy-to-use interface allows users to perform common data-cleaning actions with one click, such as removing leading/trailing whitespace, deleting duplicate rows, fixing typos everywhere they occur, or splitting multiple values that are stored in a single column into multiple columns. In addition to these simple data-wrangling tasks, OpenRefine has a few more powerful capabilities:
- Clustering: condense all variations and misspellings of a value into a single entity.
- Filtering/Sorting: view only a subset of the data that matches particular criteria, make changes to matching rows, or perform custom sorts by multiple columns.
- Faceting: view the distribution of values across a dataset by "faceting" on a column (this counts how many times unique values occur within a column, allowing you to detect errors in entry or misspellings).
Digitizing & Manipulating Text
Do you have print primary sources that you want to digitize for teaching or research? Or do you have digital images, PDFs, or text that you want to manipulate or extract information from? The resources below will provide an introduction to best practices for extracting plain text from images and digitized documents.
Data from PDFs & Images
The following tools allow users to search, edit, and extract data (including images and tables) from PDFs. They use a process called Optical Character Recognition (OCR). This process can transform scanned images to editable and extractable text or images, but the quality and accuracy varies depending on the original document. Handwriting can pose challenges.
ABBYY Fine Reader – proprietary subscription software that allows fine-tuned OCR of PDF documents, with a variety of outputs, including Microsoft Word, Excel, HTML, Rich Text, and plain text
Adobe Acrobat Pro DC – proprietary subscription software that allows a range of PDF editing options, including Optical Character Recognition (OCR). At UB there is a computer editing station with this software in Multimedia Services in Silverman Library
Tabula – a free open-source tool that allows you to extract data from PDF tables, although your PDFs must be text-based rather than image-based
Tesseract – a free open source command-line OCR engine that works with multiple operating systems and over 100 languages; they also link to third party applications that provide a graphical user interface (GUI)
Transcription processes will differ depending on the project goals and sources. Some transcriptions are text-based and include text encoding and format notations along with transcription. Other projects are more concerned with the transcription of text for enhanced access and searchability. Researchers may develop their own system of transcription for different types of sources (e.g. land deeds, manuscript census records, etc.) and may directly transcribe data into spreadsheets. Below are some examples of guidelines for a variety of projects.
Digital Humanities Workbench - This site provides an overview of transcription and suggests tools and projects.
Transcribe Bentham - This project includes transcription and encoding using TEI (Text Encoding Initiative).
Smithsonian Transcription Center - This site provides basic and advanced transcription guidelines for volunteers.
The transcription of speeches and interviews increases their accessibility and opens up possibilities for textual analysis, annotation, and multimodal projects. Automated transcription is improving, but its accuracy varies widely and often needs to be used in conjunction with manual transcription. The tools listed below can facilitate both processes.
Express Scribe - Free version for personal use. Facilitates manual transcription.
OTranscribe - Free browser-based manual transcription tool.
Temi - Automatic transcription for $.10/minute.
Trint - Automatic transcription for $15/hour.
TunesToTube - Allows you to upload an MP3 audio file with a picture to YouTube for captioning.
Editing Audio & Video
Silverman Library Multimedia Services at UB has dedicated workstations for audio and video editing and staff who can assist with projects.
Audacity - Audacity is free and open-source audio editing software.
Hindenburg Journalist or Pro - Hindenburg is a proprietary multi-track audio editor that allows you to organize your clips and tracks with an easy-to-use clipboard. The audio editing is non-destructive. They offer discounted licenses for individuals and educators.
Adobe Audition CC - Audition is a proprietary professional audio workstation available as part of the Adobe Creative Cloud suite on editing station 1 in Silverman Library.
Reaper - A proprietary digital audio workstation and MIDI sequencer software. They offer discounted licenses for individuals and educators.
GarageBand (Mac only) - A popular proprietary music recording software. Requires a third-party audio interface for multi-track recording.
Video & Audio Recording
Video Recording Studios - UB Libraries has Video Recording Studios in Silverman Library that can be reserved online.
Video Recording Equipment - UB Libraries has several camcorders and GoPro Hero4 Action Camera that can be reserved and borrowed.
Audio Equipment - UB Libraries Multimedia Services has a range of audio recording equipment that can be borrowed, including microphones, mic stands, and high-quality recorders.
Web scraping is a way to extract large amounts of data from websites. There are a variety of command-line and web-based tools and software available. The following tutorials and documentation show how to use OpenRefine, Wget, Python, WebRecorder.io, Import.io, Social Feed Manager, and more. They are geared towards history, but can be used for any humanities work.
Social Feed Manager (for building social media collections)
- Fetching and Parsing Data from the Web with OpenRefine
- Data Mining the Internet Archive Collection
- Applied Archival Downloading with Wget
- Into to Beautiful Soup
- Downloading Multiple Records Using Query Strings
- Automated Downloading with Wget
DocNow Community and Tools (for social media content)