Text Analysis

Text analysis involves collecting and organizing unstructured text in a way that will allow the discovery of patterns and other insights. 

 

Basic forms of text analysis might include finding word frequencies, collocation (words commonly appearing near each other), concordance (the contexts of a word or set of words), N-grams (common two, three, or more word phrases_), entity recognition (identifying names, places, time periods, etc.), and dictionary tagging (locating a specific set of words in the texts). 

 

Higher level uses for text analysis can include document categorization, corpora comparison, language use over time, detecting clusters of document features (i.e., topic modeling), entity recognition, and visualization.

 

Sources for Further Information

Tools and Software

  • Python - there are multiple packages available in Python designed for text analysis including NLTK, Spacy, Scikit-learn, and more
  • R - there are also multiple packages in R designed for text analysis including, OpenNLP and tidytext
  • Voyant Tools - Voyant Tools is a web-based text reading and analysis environment. It is designed to be an easy to use way to work with your text data, or a collection of texts, in a variety of formats, including plain text, HTML, XML, PDF, and MS Word (getting started with voyant tools)
  • TAPoR 3 - Discover research tools for studying texts,
  • NVivo - can cluster based on text and can also produce phrase nets and tag clouds (access for free through Yale)
  • VOSviewer - Contruct and visualize bibliometric networks with this free downloadable software

Tutorials