Package: corpustools 0.5.1

corpustools: Managing, Querying and Analyzing Tokenized Text

Provides text analysis in R, focusing on the use of a tokenized text format. In this format, the positions of tokens are maintained, and each token can be annotated (e.g., part-of-speech tags, dependency relations). Prominent features include advanced Lucene-like querying for specific tokens or contexts (e.g., documents, sentences), similarity statistics for words and documents, exporting to DTM for compatibility with many text analysis packages, and the possibility to reconstruct original text from tokens to facilitate interpretation.

Authors:Kasper Welbers and Wouter van Atteveldt

# Install 'corpustools' in R:

install.packages('corpustools', repos = c('https://kasperwelbers.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/kasperwelbers/corpustools/issues

Uses libs:

c++– GNU Standard C++ Library v3

Datasets:

corenlp_tokens - CoreNLP example sentences
sotu_texts - State of the Union addresses
stopwords_list - Basic stopword lists
tc_sotu_udpipe - A tCorpus with a small sample of sotu paragraphs parsed with udpipe

On CRAN:

cpp

7.50 score 31 stars 1 packages 174 scripts 2.0k downloads 50 exports 43 dependencies

Last updated 6 months agofrom:7610e29d57. Checks:1 OK, 11 NOTE. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Mar 04 2025
R-4.5-win-x86_64	NOTE	Mar 04 2025
R-4.5-mac-x86_64	NOTE	Mar 04 2025
R-4.5-mac-aarch64	NOTE	Mar 04 2025
R-4.5-linux-x86_64	NOTE	Mar 04 2025
R-4.4-win-x86_64	NOTE	Mar 04 2025
R-4.4-mac-x86_64	NOTE	Mar 04 2025
R-4.4-mac-aarch64	NOTE	Mar 04 2025
R-4.4-linux-x86_64	NOTE	Mar 04 2025
R-4.3-win-x86_64	NOTE	Mar 04 2025
R-4.3-mac-x86_64	NOTE	Mar 04 2025
R-4.3-mac-aarch64	NOTE	Mar 04 2025

Exports:agg_label agg_tcorpus aggregate_rsyntax as.tcorpus backbone_filter browse_hits browse_texts compare_corpus compare_documents compare_subset count_tcorpus create_tcorpus docfreq_filter dtm_wordcloud ego_semnet export_span_annotations feature_associations feature_stats fold_rsyntax freq_filter get_dfm get_dtm get_kwic get_stopwords laplace melt_quanteda_dict merge_tcorpora plot_semnet plot_words preprocess_tokens refresh_tcorpus search_contexts search_dictionary search_features semnet semnet_window set_network_attributes show_udpipe_models subset_query tc_plot_tree tCorpus tokens_to_tcorpus top_features transform_rsyntax udpipe_clause_tqueries udpipe_quote_tqueries udpipe_simplify udpipe_spanquote_tqueries udpipe_tcorpus untokenize

Dependencies:base64enc cli colorspace cpp11 data.table digest farver fastmatch glue igraph ISOcodes jsonlite labeling lattice lifecycle magrittr Matrix munsell pbapply pkgconfig png quanteda R6 RColorBrewer Rcpp RcppEigen RcppProgress rlang RNewsflow rsyntax scales SnowballC stopwords stringi tidyselect tokenbrowser udpipe vctrs viridisLite withr wordcloud xml2 yaml

corpustools: Managing, Querying and Analyzing Tokenized Text

by Kasper Welbers and Wouter van Atteveldt

Rendered fromcorpustools.Rmdusingknitr::rmarkdownon Mar 04 2025.

Last update: 2021-05-25
Started: 2019-08-15

Help page	Topics
Choose and add multitoken strings based on multitoken categories	add_multitoken_label
Helper function for aggregate_rsyntax	agg_label
Aggregate the tokens data	agg_tcorpus
Aggregate rsyntax annotations	aggregate_rsyntax
Force an object to be a tCorpus class	as.tcorpus
Force an object to be a tCorpus class	as.tcorpus.default
Force an object to be a tCorpus class	as.tcorpus.tCorpus
Extract the backbone of a network.	backbone_filter
View hits in a browser	browse_hits
Create and view a full text browser	browse_texts
Vectorized computation of chi^2 statistic for a 2x2 crosstab containing the values [a, b] [c, d]	calc_chi2
Compare tCorpus vocabulary to that of another (reference) tCorpus	compare_corpus
Calculate the similarity of documents	compare_documents
Compare vocabulary of a subset of a tCorpus to the rest of the tCorpus	compare_subset
coreNLP example sentences	corenlp_tokens
Count results of search hits, or of a given feature in tokens	count_tcorpus
Create a tCorpus	create_tcorpus create_tcorpus.character create_tcorpus.corpus create_tcorpus.data.frame create_tcorpus.factor
Support function for subset method	docfreq_filter
Compare two document term matrices	dtm_compare
Plot a word cloud from a dtm	dtm_wordcloud
Create an ego network	ego_semnet
Export span annotations	export_span_annotations
Get common nearby features given a query or query hits	feature_associations
Feature statistics	feature_stats
Fold rsyntax annotations	fold_rsyntax
Support function for subset method	freq_filter
Create a document term matrix.	get_dfm get_dtm
Compute global feature positions	get_global_i
Get keyword-in-context (KWIC) strings	get_kwic
Get a character vector of stopwords	get_stopwords
Laplace (i.e. add constant) smoothing	laplace
Convert a quanteda dictionary to a long data.table format	melt_quanteda_dict
Merge tCorpus objects	merge_tcorpora
Visualize a semnet network	plot_semnet
Plot a wordcloud with words ordered and coloured according to a dimension (x)	plot_words
S3 plot for contextHits class	plot.contextHits
visualize feature associations	plot.featureAssociations
S3 plot for featureHits class	plot.featureHits
visualize vocabularyComparison	plot.vocabularyComparison
Preprocess tokens in a character vector	preprocess_tokens
S3 print for contextHits class	print.contextHits
S3 print for featureHits class	print.featureHits
S3 print for tCorpus class	print.tCorpus
Refresh a tCorpus object using the current version of corpustools	refresh_tcorpus
Check if package with given version exists	require_package
Search for documents or sentences using Boolean queries	search_contexts
Dictionary lookup	search_dictionary
Find tokens using a Lucene-like search query	search_features
Create a semantic network based on the co-occurence of tokens in documents	semnet
Create a semantic network based on the co-occurence of tokens in token windows	semnet_window
Set some default network attributes for pretty plotting	set_network_attributes
Simple Good Turing smoothing	sgt
Show the names of udpipe models	show_udpipe_models
State of the Union addresses	sotu_texts
Basic stopword lists	stopwords_list
Subset tCorpus token data using a query	subset_query
S3 subset for tCorpus class	subset.tCorpus
S3 summary for contextHits class	summary.contextHits
S3 summary for featureHits class	summary.featureHits
Summary of a tCorpus object	summary.tCorpus
Visualize a dependency tree	tc_plot_tree
A tCorpus with a small sample of sotu paragraphs parsed with udpipe	tc_sotu_udpipe
tCorpus: a corpus class for tokenized texts	tCorpus tcorpus
Corpus comparison	tCorpus_compare
Creating a tCorpus	tCorpus_create
Methods and functions for viewing, modifying and subsetting tCorpus data	tCorpus_data
Document similarity	tCorpus_docsim
Preprocessing, subsetting and analyzing features	tCorpus_features
Modify tCorpus by reference	tCorpus_modify_by_reference
Use Boolean queries to analyze the tCorpus	tCorpus_querying
Feature co-occurrence based semantic network analysis	tCorpus_semnet
Topic modeling	tCorpus_topmod
Annotate tokens based on rsyntax queries	annotate_rsyntax tCorpus$annotate_rsyntax
Dictionary lookup	code_dictionary tCorpus$code_dictionary
Code features in a tCorpus based on a search string	code_features tCorpus$code_features
Get a context vector	context tCorpus$context
Deduplicate documents	deduplicate tCorpus$deduplicate
Delete column from the data and meta data	delete_columns delete_meta_columns tCorpus$delete_columns tCorpus$delete_meta_columns
Cast the "feats" column in UDpipe tokens to columns	feats_to_columms tCorpus$feats_to_columns
Filter features	feature_subset tCorpus$feature_subset
Fold rsyntax annotations	tCorpus$fold_rsyntax
Access the data from a tCorpus	get get_meta tCorpus$get tCorpus$get_meta
Estimate a LDA topic model	lda_fit tCorpus$lda_fit
Merge the token and meta data.tables of a tCorpus with another data.frame	merge merge_meta tCorpus$merge
Preprocess feature	preprocess tCorpus$preprocess
Replace tokens with dictionary match	replace_dictionary tCorpus$replace_dictionary
Recode features in a tCorpus based on a search string	search_recode tCorpus$search_recode
Modify the token and meta data.tables of a tCorpus	set set_meta tCorpus$set tCorpus$set_meta
Change levels of factor columns	set_levels set_meta_levels tCorpus$set_levels tCorpus$set_meta_levels
Change column names of data and meta data	set_meta_name set_name tCorpus$set_meta_name tCorpus$set_name
Subset a tCorpus	subset subset_meta tCorpus$subset tCorpus$subset_meta
Subset tCorpus token data using a query	tCorpus$subset_query
Add columns indicating who did what	tCorpus$udpipe_clauses udpipe_clauses
Add columns indicating who said what	tCorpus$udpipe_quotes udpipe_quotes
Create a tcorpus based on tokens (i.e. preprocessed texts)	tokens_to_tcorpus
Gives the window in which a term occured in a matrix.	tokenWindowOccurence
Show top features	top_features
Apply rsyntax transformations	transform_rsyntax
Get a list of tqueries for extracting who did what	udpipe_clause_tqueries
Get a list of tqueries for extracting quotes	udpipe_quote_tqueries
Simplify tokenIndex created with the udpipe parser	udpipe_simplify
Get a list of tqueries for finding candidates for span quotes.	udpipe_spanquote_tqueries
Create a tCorpus using udpipe	udpipe_tcorpus udpipe_tcorpus.character udpipe_tcorpus.corpus udpipe_tcorpus.data.frame udpipe_tcorpus.factor
Reconstruct original texts	untokenize

Package: corpustools 0.5.1

corpustools: Managing, Querying and Analyzing Tokenized Text

corpustools: Managing, Querying and Analyzing Tokenized Text

Citation

Development and contributors

Readme and manuals

Help Manual

Usage by other packages (reverse dependencies)