In digital humanities, METS (Metadata Encoding and Transmission Standard) and ALTO (Analyzed Layout and Text Object) are common XML standards used to describe the structure and content of digitized text documents, such as newspapers, books, and manuscripts. When working with digitized archives or libraries, researchers often encounter these files and need effective ways to process them. In the following code snippet I’ll explore an R script that streamlines the extraction of text blocks or articles from METS and ALTO files, making it easier to analyze the textual content.
Before diving into the code, some information about their structure:
METS: It’s an XML schema used for describing the structure of digital library objects but does not encode the actual textual content of the object.
ALTO: This XML standard is often used alongside METS and describes the content of text pages, such as the layout of text blocks, lines, words, and the textual content itself.
Together, METS and ALTO provide both the structural and content description of digital texts and are the current industry standard for newspaper digitization.
The resulting dataframe can be used for various use cases:
Text Analysis: Analyze textual content, such as conducting topic modeling, sentiment analysis, or keyword extraction.
Data Cleaning: Given the word confidence scores, you can identify and address low-confidence words or OCR errors.
Structural Analysis: Understand the layout and structure of the document, which might be crucial for certain research questions.
If you want to test the code below, you’ll need some METS/ALTO files. For testing purposes, I provide the files from a Swiss german newspaper from 1802 consisting of one METS and four ALTO files on my website.
You can find them here: [ METS ] [ ALTO1 | ALTO2 | ALTO3 | ALTO4 ]
The code:
Comments You need to have a GitHub Account to comment!
Comments You need to have a GitHub Account to comment!
Post comment