🌑

Stephen's Blog

Working with Text in R

 

Stephen Cheng

Intro

Working with text data is a critical skill in data analysis, enabling you to process, clean, and extract valuable insights from unstructured information. In R, powerful tools like regular expressions, string manipulation functions, and specialized packages such as stringr and tidytext make text handling efficient and intuitive. This guide introduces essential techniques and functions for working with text in R, helping you tackle tasks ranging from basic string operations to advanced text mining and natural language processing.

Packages

We will use the following packages.

  • stringr: for basic text manipulation (part of the tidyverse)
  • readtext: for reading different text files (incl. websites)
  • tesseract: for converting PDF files into text

Formatting

  • Text

    1
    text

    [1] “CDSI - Computing & Data Systems Initiative at McGill”
    [2] “ Phew! It’s getting cold! “
    [3] “The phone number at the CDSI is not 555-111-2222!”

  • Length of text

    1
    nchar(text)

    [1] 52 27 49

  • Deleting whitespace

    1
    2
    library(stringr)
    str_trim(text)

    [1] “CDSI - Computing & Data Systems Initiative at McGill”
    [2] “Phew! It’s getting cold!”
    [3] “The phone number at the CDSI is not 555-111-2222!”

  • Adding whitespace

    1
    str_pad(text, 52, "right")

    [1] “CDSI - Computing & Data Systems Initiative at McGill”
    [2] “ Phew! It’s getting cold! “
    [3] “The phone number at the CDSI is not 555-111-2222! “

  • Gets the position of Sorting

    1
    str_order(text)

    [1] 2 1 3

  • Actually sorts the data

    1
    str_sort(text)

    [1] “ Phew! It’s getting cold! “
    [2] “CDSI - Computing & Data Systems Initiative at McGill”
    [3] “The phone number at the CDSI is not 555-111-2222!”

  • Capping

    1
    str_trunc(text, 40)

    [1] “CDSI - Computing & Data Systems Initi…”
    [2] “ Phew! It’s getting cold! “
    [3] “The phone number at the CDSI is not 5…”

  • Adding line breaks

    1
    str_wrap(text, 40)

    [1] “CDSI - Computing & Data Systems\nInitiative at McGill”
    [2] “Phew! It’s getting cold!”
    [3] “The phone number at the CDSI is not\n555-111-2222!”

  • Executes escaped characters

    1
    cat(str_wrap(text, 40))

    CDSI - Computing & Data Systems
    Initiative at McGill Phew! It’s getting cold! The phone number at the CDSI is
    not
    555-111-2222!

  • Convert to string lower

    1
    str_to_lower(text)

    [1] “cdsi - computing & data systems initiative at mcgill”
    [2] “ phew! it’s getting cold! “
    [3] “the phone number at the cdsi is not 555-111-2222!”

  • Convert to string title

    1
    str_to_title(text)

    [1] “Cdsi - Computing & Data Systems Initiative At Mcgill”
    [2] “ Phew! It’s Getting Cold! “
    [3] “The Phone Number At The Cdsi Is Not 555-111-2222!”

Text Manipulation

  • Combining text

    1
    2
    3
    text1 <- "A"
    text2 <- "number"
    paste(text1, text2)

    [1] “A number”

  • Defines the separator

    1
    paste(text1, text2, sep = "")

    [1] “Anumber”

    1
    paste0(text1, text2)

    [1] “Anumber”

  • Extracting text

    1
    str_sub(text, 1, 9)

    [1] “CDSI - Co” “ Phew! I” “The phone”

  • To get text from the right hand side

    1
    str_sub(text, -4, -1)

    [1] “Gill” “ld! “ “222!”

  • Splitting text

    1
    str_split(text, boundary("word"))

    [[1]]
    [1] “CDSI” “Computing” “Data” “Systems” “Initiative”
    [6] “at” “McGill”
    [[2]]
    [1] “Phew” “It’s” “getting” “cold”
    [[3]]
    [1] “The” “phone” “number” “at” “the” “CDSI” “is” “not”
    [9] “555” “111” “2222”

  • Spliting text by ‘at’

    1
    str_split(text[1], "at")

    [[1]]
    [1] “CDSI - Computing & D” “a Systems Initi” “ive “
    [4] “ McGill”

    1
    str_split_1(text[1], "at")

    [1] “CDSI - Computing & D” “a Systems Initi” “ive “
    [4] “ McGill”

Importing Text

  • Importing text from a pdf/image

    1
    2
    library(tesseract)
    text3 <- ocr("data/Canadian_Geographer.pdf")
  • Importing with readtext

    1
    2
    3
    4
    5
    library(readtext)
    text4 <- readtext("data/CAG_Newsletter.pdf")
    text5 <- readtext("https://laws-lois.justice.gc.ca/eng/acts/o-3.01/fulltext.htm
    text6 <- readtext("data/PhD_Guide.doc")
    text7 <- readtext("data/New_Students_Guide.docx")
  • Read a full directories of text files

    1
    text <- readtext("data/")

, , — Oct 19, 2023

Search

    Made with ❤️ and ☀️ on Earth.