Extracting text blocks and processing METS/ALTO data with R

[ #R #NLP ]

In digital humanities, METS (Metadata Encoding and Transmission Standard) and ALTO (Analyzed Layout and Text Object) are common XML standards used to describe the structure and content of digitized text documents, such as newspapers, books, and manuscripts. When working with digitized archives or libraries, researchers often encounter these files and need effective ways to process them. In the following code snippet I’ll explore an R script that streamlines the extraction of text blocks or articles from METS and ALTO files, making it easier to analyze the textual content.

Before diving into the code, some information about their structure:

METS: It’s an XML schema used for describing the structure of digital library objects but does not encode the actual textual content of the object.

ALTO: This XML standard is often used alongside METS and describes the content of text pages, such as the layout of text blocks, lines, words, and the textual content itself.

Together, METS and ALTO provide both the structural and content description of digital texts and are the current industry standard for newspaper digitization.

The resulting dataframe can be used for various use cases:

Text Analysis: Analyze textual content, such as conducting topic modeling, sentiment analysis, or keyword extraction.

Data Cleaning: Given the word confidence scores, you can identify and address low-confidence words or OCR errors.

Structural Analysis: Understand the layout and structure of the document, which might be crucial for certain research questions.

If you want to test the code below, you’ll need some METS/ALTO files. For testing purposes, I provide the files from a Swiss german newspaper from 1802 consisting of one METS and four ALTO files on my website. You can find them here: [ METS ] [ ALTO1 | ALTO2 | ALTO3 | ALTO4 ]

The code:

# Preliminaries
#-------------------------------------------------------------------------------
# Load required libraries
library(stringr)  # For string manipulation
library(xml2)     # For XML processing
library(furrr)    # Parallelized purrr operations
library(dplyr)    # For data manipulation

# For parallel computing
plan(multisession)

# Define a function to process a METS file and convert its XML content to a data frame
mets_to_df <- function(foldername) {
  # Get the filename of the METS file from the specified directory
  metsfilename <-  str_match(list.files(path = foldername, 
                                        all.files = TRUE, 
                                        recursive = TRUE, 
                                        full.names = TRUE),
                             ".*mets.xml|.*METS.xml") %>%
    na.omit() 
  
  # Read the XML METS file
  metsfile = read_xml(metsfilename)
  
  # Extract the 'structMap' with 'LOGICAL' type
  doc <- metsfile %>% 
    xml_ns_strip() %>% 
    xml_find_all( ".//structMap[@TYPE='LOGICAL']")
  
  # Initialization
  dmdid <- NA
  title <- NA
  artType <- NA
  df <- tibble()
  
  # Define a recursive function to traverse METS XML nodes and extract information
  node_to_df <- function(node) {
    children <- xml_children(node)
    
    if (length(children) >= 1) {
      
      # Extract attributes if they exist
      if (!is.na(xml_attr(node, "DMDID"))) {
        dmdid <<- xml_attr(node, "DMDID")
        artType <<- xml_attr(node, "TYPE")
        title <<- xml_attr(node, "LABEL")
      }
      
      # Recursively process child nodes
      df <- 
        children %>%
        map(node_to_df) %>%
        map(list)  %>%
        flatten() %>%
        map(bind_rows)
      
    } else {
      # Create a data frame row for the current node
      df <-
        bind_rows(df, tibble(title=title,
                             artType = artType,
                             dmdid = dmdid,
                             betype = xml_attr(node, "BETYPE"),
                             fileid = xml_attr(node, "FILEID"),
                             begin = xml_attr(node, "BEGIN")))
    }
    bind_rows(df)
  }
  node_to_df(doc)
}

# Define a function to process an ALTO file and convert its XML content to a data frame
alto_to_df <- function(altofilename) {
  # Read the XML ALTO file
  altofile = read_xml(altofilename)
  
  # Extract the 'PrintSpace' node
  doc <- altofile %>% 
    xml_ns_strip() %>% 
    xml_find_all( ".//PrintSpace")
  
  # Check for the existence of the 'ComposedBlock' node
  has_composedblock <- if_else(length(xml_find_all(doc, ".//ComposedBlock")[1]) == 0, F, T)
  
  # Initialization
  blockid <- NA_character_
  df <- tibble()
  
  # Define a recursive function to traverse ALTO XML nodes and extract information
  node_to_df <- function(node) {
    
    children <- xml_children(node)
    
    if (length(children) >= 1) {
      # Determine the type of block based on the XML structure
      if (has_composedblock) {
        if (xml_name(node) == "ComposedBlock") {
          blockid <<- xml_attr(node,"ID")
        }
      }
      else { 
        if (xml_name(node) == "TextBlock") {
          blockid <<- xml_attr(node,"ID")
        }
      }
      
      # Recursively process child nodes
      df <- 
        children %>%
        map(node_to_df) %>%
        map(list)  %>%
        flatten() %>%
        map(bind_rows)
      
    } else {
      # Create a data frame row for the current node
      df <-
        bind_rows(df, tibble(
          blockid = blockid,
          content = if_else(xml_has_attr(node, "SUBS_CONTENT"), xml_attr(node, "SUBS_CONTENT"), xml_attr(node, "CONTENT")),
          WC = as.numeric(xml_attr(node, "WC")),
          id = xml_attr(node, "ID"))) 
    }
    bind_rows(df)
  }
  node_to_df(doc)
}


# Process the files
#-------------------------------------------------------------------------------
# Define the directory name containing METS files
metsfolder <- paste0("27_01")

# List all ALTO files from the specified directory. Files are identified by a 4-digit pattern followed by .xml
altofiles <-  str_match(list.files(path = metsfolder, 
                                   all.files = TRUE, 
                                   recursive = TRUE, 
                                   full.names = TRUE), 
                        ".*[0-9]{4}.xml") %>%
  na.omit()

# Process each ALTO file in parallel, convert it to a data frame, and then filter the results
alto <- future_map_dfr(altofiles, alto_to_df) %>%
  filter(!is.na(id), !is.na(content)) %>%
  filter(content != dplyr::lag(content, default="1"))

# Process the METS file and convert it to a data frame
mets <- mets_to_df(metsfolder) 

# Combine METS and ALTO data, group by 'begin', summarize the results, and perform other transformations
paragraphs <- left_join(mets, alto, by=c("begin" = "blockid")) %>%
  group_by(begin) %>%
  summarize(text = paste0(content, collapse = " "),
            WC = mean(WC, na.rm=T),
            title = first(title),
            artType = first(artType),
            dmdid = first(dmdid)) %>%
  filter(!is.na(WC)) %>%
  mutate(text = stringi::stri_replace_all(text,"ss", fixed = "\u00DF")) # Replace 'ss' with esszet (ß)