Working with text data is a critical skill in data analysis, enabling you to process, clean, and extract valuable insights from unstructured information. In R, powerful tools like regular expressions, string manipulation functions, and specialized packages such as stringr and tidytext make text handling efficient and intuitive. This guide introduces essential techniques and functions for working with text in R, helping you tackle tasks ranging from basic string operations to advanced text mining and natural language processing.
We will use the following packages.
Text
1 | text |
[1] “CDSI - Computing & Data Systems Initiative at McGill”
[2] “ Phew! It’s getting cold! “
[3] “The phone number at the CDSI is not 555-111-2222!”
Length of text
1 | nchar(text) |
[1] 52 27 49
Deleting whitespace
1 | library(stringr) |
[1] “CDSI - Computing & Data Systems Initiative at McGill”
[2] “Phew! It’s getting cold!”
[3] “The phone number at the CDSI is not 555-111-2222!”
Adding whitespace
1 | str_pad(text, 52, "right") |
[1] “CDSI - Computing & Data Systems Initiative at McGill”
[2] “ Phew! It’s getting cold! “
[3] “The phone number at the CDSI is not 555-111-2222! “
Gets the position of Sorting
1 | str_order(text) |
[1] 2 1 3
Actually sorts the data
1 | str_sort(text) |
[1] “ Phew! It’s getting cold! “
[2] “CDSI - Computing & Data Systems Initiative at McGill”
[3] “The phone number at the CDSI is not 555-111-2222!”
Capping
1 | str_trunc(text, 40) |
[1] “CDSI - Computing & Data Systems Initi…”
[2] “ Phew! It’s getting cold! “
[3] “The phone number at the CDSI is not 5…”
Adding line breaks
1 | str_wrap(text, 40) |
[1] “CDSI - Computing & Data Systems\nInitiative at McGill”
[2] “Phew! It’s getting cold!”
[3] “The phone number at the CDSI is not\n555-111-2222!”
Executes escaped characters
1 | cat(str_wrap(text, 40)) |
CDSI - Computing & Data Systems
Initiative at McGill Phew! It’s getting cold! The phone number at the CDSI is
not
555-111-2222!
Convert to string lower
1 | str_to_lower(text) |
[1] “cdsi - computing & data systems initiative at mcgill”
[2] “ phew! it’s getting cold! “
[3] “the phone number at the cdsi is not 555-111-2222!”
Convert to string title
1 | str_to_title(text) |
[1] “Cdsi - Computing & Data Systems Initiative At Mcgill”
[2] “ Phew! It’s Getting Cold! “
[3] “The Phone Number At The Cdsi Is Not 555-111-2222!”
Combining text
1 | text1 <- "A" |
[1] “A number”
Defines the separator
1 | paste(text1, text2, sep = "") |
[1] “Anumber”
1 | paste0(text1, text2) |
[1] “Anumber”
Extracting text
1 | str_sub(text, 1, 9) |
[1] “CDSI - Co” “ Phew! I” “The phone”
To get text from the right hand side
1 | str_sub(text, -4, -1) |
[1] “Gill” “ld! “ “222!”
Splitting text
1 | str_split(text, boundary("word")) |
[[1]]
[1] “CDSI” “Computing” “Data” “Systems” “Initiative”
[6] “at” “McGill”
[[2]]
[1] “Phew” “It’s” “getting” “cold”
[[3]]
[1] “The” “phone” “number” “at” “the” “CDSI” “is” “not”
[9] “555” “111” “2222”
Spliting text by ‘at’
1 | str_split(text[1], "at") |
[[1]]
[1] “CDSI - Computing & D” “a Systems Initi” “ive “
[4] “ McGill”
1 | str_split_1(text[1], "at") |
[1] “CDSI - Computing & D” “a Systems Initi” “ive “
[4] “ McGill”
Importing text from a pdf/image
1 | library(tesseract) |
Importing with readtext
1 | library(readtext) |
Read a full directories of text files
1 | text <- readtext("data/") |
Data Science, R, Text — Oct 19, 2023
Made with ❤️ and ☀️ on Earth.