textshape

Tools for reshaping text data

42
3
R

title: “textshape”
date: “r format(Sys.time(), '%d %B, %Y')
output:
md_document:
toc: true

library(knitr)
knitr::opts_chunk$set(fig.path = "tools/figure/")
combine <- textshape::combine

library(tidyverse)
library(magrittr)
library(ggstance)
library(textshape)
library(gridExtra)
library(viridis)
library(quanteda)
library(gofastr)

## desc <- suppressWarnings(readLines("DESCRIPTION"))
## regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
## loc <- grep(regex, desc)
## ver <- gsub(regex, "\\2", desc[loc])
## verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img ## src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver)
## verbadge <- ''
knit_hooks$set(htmlcap = function(before, options, envir) {
  if(!before) {
    paste('<p class="caption"><b><em>',options$htmlcap,"</em></b></p>",sep="")
    }
    })
knitr::opts_knit$set(self.contained = TRUE, cache = FALSE)

Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.

textshape is small suite of text reshaping and restructuring functions. Many of these functions are descended from tools in the qdapTools package. This brings reshaping tools under one roof with specific functionality of the package limited to text reshaping.

Other R packages provide some of the same functionality. textshape differs from these packages in that it is designed to help the user take unstructured data (or implicitly structured), extract it into a structured format, and then restructure into common text analysis formats for use in the next stage of the text analysis pipeline. The implicit structure of seemingly unstructured data is often detectable/expressible by the researcher. textshape provides tools (e.g., split_match) to enable the researcher to convert this tacit knowledge into a form that can be used to reformat data into more structured formats. This package is meant to be used jointly with the textclean package, which provides cleaning and text normalization functionality.

Functions

Most of the functions split, expand, grab, or tidy a vector, list, data.frame, or DocumentTermMatrix. The combine, duration, mtabulate, & flatten functions are notable exceptions. The table below describes the functions and their use:

Function Used On Description
combine vector, list, data.frame Combine and collapse elements
tidy_list list of vectors or data.frames Row bind a list and repeat list names as id column
tidy_vector vector Column bind a named atomic vector’s names and values
tidy_table table Column bind a table’s names and values
tidy_matrix matrix Stack values, repeat column row names accordingly
tidy_dtm/tidy_tdm DocumentTermMatrix Tidy format DocumentTermMatrix/TermDocumentMatrix
tidy_colo_dtm/tidy_colo_tdm DocumentTermMatrix Tidy format of collocating words from a DocumentTermMatrix/TermDocumentMatrix
duration vector, data.frame Get duration (start-end times) for turns of talk in n words
from_to vector, data.frame Prepare speaker data for a flow network
mtabulate vector, list, data.frame Dataframe/list version of tabulate to produce count matrix
flatten list Flatten nested, named list to single tier
unnest_text data.frame Unnest a nested text column
split_index vector, list, data.frame Split at specified indices
split_match vector Split vector at specified character/regex match
split_portion vector* Split data into portioned chunks
split_run vector, data.frame Split runs (e.g., “aaabbbbcdddd”)
split_sentence vector, data.frame Split sentences
split_speaker data.frame Split combined speakers (e.g., “Josh, Jake, Jim”)
split_token vector, data.frame Split words and punctuation
split_transcript vector Split speaker and dialogue (e.g., “greg: Who me”)
split_word vector, data.frame Split words
grab_index vector, data.frame, list Grab from an index up to a second index
grab_match vector, data.frame, list Grab from a regex match up to a second regex match
column_to_rownames data.frame Add a column as rownames
cluster_matrix matrix Reorder column/rows of a matrix via hierarchical clustering

*Note: Text vector accompanied by aggregating grouping.var argument, which can be in the form of a vector, list, or data.frame

Installation

To download the development version of textshape:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textshape")

Contact

You are welcome to:

Contributing

Contributions are welcome from anyone subject to the following rules:

  • Abide by the code of conduct.
  • Follow the style conventions of the package (indentation, function & argument naming, commenting, etc.)
  • All contributions must be consistent with the package license (GPL-2)
  • Submit contributions as a pull request. Clearly state what the changes are and try to keep the number of changes per pull request as low as possible.
  • If you make big changes, add your name to the ‘Author’ field.

Examples

The main shaping functions can be broken into the categories of (a) binding, (b) combining, (c) tabulating, (d) spanning, (e) splitting, (f) grabbing & (e) tidying. The majority of functions in textshape fall into the category of splitting and expanding (the semantic opposite of combining). These sections will provide example uses of the functions from textshape within the three categories.

Loading Dependencies

if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, magrittr, ggstance, viridis, gridExtra, quanteda)
pacman::p_load_current_gh('trinker/gofastr', 'trinker/textshape')

Tidying

The tidy_xxx functions convert untidy structures into tidy format. Tidy formatted text data structures are particularly useful for interfacing with ggplot2, which expects this form.

The tidy_list function is used in the style of do.call(rbind, list(x1, x2)) as a convenient way to bind together multiple named data.frames or vectorss into a single data.frame with the list names acting as an id column. The data.frame bind is particularly useful for binding transcripts from different observations. Additionally, tidy_vector and tidy_table are provided for cbinding a table’s or named atomic vector’s values and names as separate columns in a data.frame. Lastly, tidy_dtm/tidy_tdm provide convenient ways to tidy a DocumentTermMatrix or TermDocumentMatrix.

A Vector

x <- list(p=1:500, r=letters)
tidy_list(x)

A Dataframe

x <- list(p=mtcars, r=mtcars, z=mtcars, d=mtcars)
tidy_list(x) 

A Named Vector

x <- setNames(
    sample(LETTERS[1:6], 1000, TRUE), 
    sample(state.name[1:5], 1000, TRUE)
)
tidy_vector(x)

A Table

x <- table(sample(LETTERS[1:6], 1000, TRUE))
tidy_table(x)

A Matrix

mat <- matrix(1:16, nrow = 4,
    dimnames = list(LETTERS[1:4], LETTERS[23:26])
)

mat

tidy_matrix(mat)

With clustering (column and row reordering) via the cluster_matrix function.

## plot heatmap w/o clustering
wo <- mtcars %>%
    cor() %>%
    tidy_matrix('car', 'var') %>%
    ggplot(aes(var, car, fill = value)) +
         geom_tile() +
         scale_fill_viridis(name = expression(r[xy])) +
         theme(
             axis.text.y = element_text(size = 8)   ,
             axis.text.x = element_text(size = 8, hjust = 1, vjust = 1, angle = 45),   
             legend.position = 'bottom',
             legend.key.height = grid::unit(.1, 'cm'),
             legend.key.width = grid::unit(.5, 'cm')
         ) +
         labs(subtitle = "With Out Clustering")

## plot heatmap w clustering
w <- mtcars %>%
    cor() %>%
    cluster_matrix() %>%
    tidy_matrix('car', 'var') %>%
    mutate(
        var = factor(var, levels = unique(var)),
        car = factor(car, levels = unique(car))        
    ) %>%
    group_by(var) %>%
    ggplot(aes(var, car, fill = value)) +
         geom_tile() +
         scale_fill_viridis(name = expression(r[xy])) +
         theme(
             axis.text.y = element_text(size = 8)   ,
             axis.text.x = element_text(size = 8, hjust = 1, vjust = 1, angle = 45),   
             legend.position = 'bottom',
             legend.key.height = grid::unit(.1, 'cm'),
             legend.key.width = grid::unit(.5, 'cm')               
         ) +
         labs(subtitle = "With Clustering")

grid.arrange(wo, w, ncol = 2)

A DocumentTermMatrix

The tidy_dtm and tidy_tdm functions convert a DocumentTermMatrix or TermDocumentMatrix into a tidied data set.

my_dtm <- with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_")))

tidy_dtm(my_dtm) %>%
    tidyr::extract(doc, c("time", "turn", "sentence"), "(\\d)_(\\d+)\\.(\\d+)") %>%
    mutate(
        time = as.numeric(time),
        turn = as.numeric(turn),
        sentence = as.numeric(sentence)
    ) %>%
    tbl_df() %T>%
    print() %>%
    group_by(time, term) %>%
    summarize(n = sum(n)) %>%
    group_by(time) %>%
    arrange(desc(n)) %>%
    slice(1:10) %>%
    mutate(term = factor(paste(term, time, sep = "__"), levels = rev(paste(term, time, sep = "__")))) %>%
    ggplot(aes(x = n, y = term)) +
        geom_barh(stat='identity') +
        facet_wrap(~time, ncol=2, scales = 'free_y') +
        scale_y_discrete(labels = function(x) gsub("__.+$", "", x))

A DocumentTermMatrix of Collocations

The tidy_colo_dtm and tidy_colo_tdm functions convert a DocumentTermMatrix or TermDocumentMatrix into a collocation matrix and then a tidied data set.

my_dtm <- with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_")))
sw <- unique(c(
    lexicon::sw_jockers, 
    lexicon::sw_loughran_mcdonald_long, 
    lexicon::sw_fry_1000
))

tidy_colo_dtm(my_dtm) %>%
    tbl_df() %>%
    filter(!term_1 %in% c('i', sw) & !term_2 %in% sw) %>%
    filter(term_1 != term_2) %>%
    unique_pairs() %>%
    filter(n > 15) %>%
    complete(term_1, term_2, fill = list(n = 0)) %>%
    ggplot(aes(x = term_1, y = term_2, fill = n)) +
        geom_tile() +
        scale_fill_gradient(low= 'white', high = 'red') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Combining

The combine function acts like paste(x, collapse=" ") on vectors and lists of vectors. On dataframes multiple text cells are pasted together within grouping variables.

A Vector

x <- c("Computer", "is", "fun", ".", "Not", "too", "fun", ".")
combine(x)

A Dataframe

(dat <- split_sentence(DATA))
combine(dat[, 1:5, with=FALSE])

Tabulating

mtabulate allows the user to transform data types into a dataframe of counts.

A Vector

(x <- list(w=letters[1:10], x=letters[1:5], z=letters))
mtabulate(x)

## Dummy coding
mtabulate(mtcars$cyl[1:10])

A Dataframe

(dat <- data.frame(matrix(sample(c("A", "B"), 30, TRUE), ncol=3)))
mtabulate(dat)
t(mtabulate(dat))

Spanning

Often it is useful to know the duration (start-end) of turns of talk. The duration function calculates start-end durations as n words.

A Vector

(x <- c(
    "Mr. Brown comes! He says hello. i give him coffee.",
    "I'll go at 5 p. m. eastern time.  Or somewhere in between!",
    "go there"
))
duration(x)
# With grouping variables
groups <- list(group1 = c("A", "B", "A"), group2 = c("red", "red", "green"))
duration(x, groups)

A Dataframe

duration(DATA)

Gantt Plot

library(ggplot2)
ggplot(duration(DATA), aes(x = start, xend = end, y = person, yend = person, color = sex)) +
    geom_segment(size=4) +
    xlab("Duration (Words)") +
    ylab("Person")

Splitting

The following section provides examples of available splitting functions.

Indices

split_index allows the user to supply the integer indices of where to split a data type.

A Vector

split_index(
    LETTERS, 
    indices = c(4, 10, 16), 
    names = c("dog", "cat", "chicken", "rabbit")
)

A Dataframe

Here I calculate the indices of every time the vs variable in the mtcars data set changes and then split the dataframe on those indices. The change_index function is handy for extracting the indices of changes in runs within an atomic vector.

(vs_change <- change_index(mtcars[["vs"]]))
split_index(mtcars, indices = vs_change)

Matches

split_match splits on elements that match exactly or via a regular expression match.

Exact Match

set.seed(15)
(x <- sample(c("", LETTERS[1:10]), 25, TRUE, prob=c(.2, rep(.08, 10))))

split_match(x)
split_match(x, split = "C")
split_match(x, split = c("", "C"))
## Don't include
split_match(x, include = 0)
## Include at beginning
split_match(x, include = 1)
## Include at end
split_match(x, include = 2)

Regex Match

Here I use the regex "^I" to break on any vectors containing the capital letter I as the first character.

split_match(DATA[["state"]], split = "^I", regex=TRUE, include = 1)

Portions

At times it is useful to split texts into portioned chunks, operate on the chunks and aggregate the results. split_portion allows the user to do this sort of text shaping. We can split into n chunks per grouping variable (via n.chunks) or into chunks of n length (via n.words).

A Vector

with(DATA, split_portion(state, n.chunks = 10))
with(DATA, split_portion(state, n.words = 10))

A Dataframe

with(DATA, split_portion(state, list(sex, adult), n.words = 10))

Runs

split_run allows the user to split up runs of identical characters.

x1 <- c(
     "122333444455555666666",
     NA,
     "abbcccddddeeeeeffffff",
     "sddfg",
     "11112222333"
)

x <- c(rep(x1, 2), ">>???,,,,....::::;[[")

split_run(x)

Dataframe

DATA[["run.col"]] <- x
split_run(DATA)
## Reset the DATA dataset
DATA <- textshape::DATA

Sentences

split_sentece provides a mapping + regex approach to splitting sentences. It is less accurate than the Stanford parser but more accurate than a simple regular expression approach alone.

A Vector

(x <- paste0(
    "Mr. Brown comes! He says hello. i give him coffee.  i will ",
    "go at 5 p. m. eastern time.  Or somewhere in between!go there"
))
split_sentence(x)

A Dataframe

split_sentence(DATA)

Speakers

Often speakers may talk in unison. This is often displayed in a single cell as a comma separated string of speakers. Some analysis may require this information to be parsed out and replicated as one turn per speaker. The split_speaker function accomplishes this.

DATA$person <- as.character(DATA$person)
DATA$person[c(1, 4, 6)] <- c("greg, sally, & sam",
    "greg, sally", "sam and sally")
DATA

split_speaker(DATA)
## Reset the DATA dataset
DATA <- textshape::DATA

Tokens

The split_token function split data into words and punctuation.

A Vector

(x <- c(
    "Mr. Brown comes! He says hello. i give him coffee.",
    "I'll go at 5 p. m. eastern time.  Or somewhere in between!",
    "go there"
))
split_token(x)

A Dataframe

 split_token(DATA)

Transcript

The split_transcript function splits vectors with speaker prefixes (e.g., c("greg: Who me", "sarah: yes you!")) into a two column data.frame.

A Vector

(x <- c(
    "greg: Who me", 
    "sarah: yes you!",
    "greg: well why didn't you say so?",
    "sarah: I did but you weren't listening.",
    "greg: oh :-/ I see...",
    "dan: Ok let's meet at 4:30 pm for drinks"
))

split_transcript(x)

Words

The split_word function splits data into words.

A Vector

(x <- c(
    "Mr. Brown comes! He says hello. i give him coffee.",
    "I'll go at 5 p. m. eastern time.  Or somewhere in between!",
    "go there"
))
split_word(x)

A Dataframe

 split_word(DATA)

Grabbing

The following section provides examples of available grabbing (from a starting point up to an ending point) functions.

Indices

grab_index allows the user to supply the integer indices of where to grab (from - up to) a data type.

A Vector

grab_index(DATA$state, from = 2, to = 4)
grab_index(DATA$state, from = 9)
grab_index(DATA$state, to = 3)

A Dataframe

grab_index(DATA, from = 2, to = 4)

A List

grab_index(as.list(DATA$state), from = 2, to = 4)

Matches

grab_match grabs (from - up to) elements that match a regular expression.

A Vector

grab_match(DATA$state, from = 'dumb', to = 'liar')
grab_match(DATA$state, from = '^What are')
grab_match(DATA$state, to = 'we do[?]')
grab_match(DATA$state, from = 'no', to = 'the', ignore.case = TRUE, 
    from.n = 'last', to.n = 'first')

A Dataframe

grab_match(DATA, from = 'dumb', to = 'liar')

A List

grab_match(as.list(DATA$state), from = 'dumb', to = 'liar')

Putting It Together

Eduardo Flores blogged about What the candidates say, analyzing republican debates using R where he demonstrated some scraping and analysis techniques. Here I highlight a combination usage of textshape tools to scrape and structure the text from 4 of the 2015 Republican debates within a magrittr pipeline. The result is a single data.table containing the dialogue from all 4 debates. The code highlights the conciseness and readability of textshape by restructuring Flores scraping with textshape replacements.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(rvest, magrittr, xml2)

debates <- c(
    wisconsin = "110908",
    boulder = "110906",
    california = "110756",
    ohio = "110489"
)

lapply(debates, function(x){
    xml2::read_html(paste0("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)) %>%
        rvest::html_nodes("p") %>%
        rvest::html_text() %>%
        textshape::split_index(., grep("^[A-Z]+:", .)) %>%
        #textshape::split_match("^[A-Z]+:", TRUE, TRUE) %>% #equal to line above
        textshape::combine() %>%
        textshape::split_transcript() %>%
        textshape::split_sentence()
}) %>%
    textshape::tidy_list("location")