Package: corpustools 0.5.1

corpustools: Managing, Querying and Analyzing Tokenized Text

Provides text analysis in R, focusing on the use of a tokenized text format. In this format, the positions of tokens are maintained, and each token can be annotated (e.g., part-of-speech tags, dependency relations). Prominent features include advanced Lucene-like querying for specific tokens or contexts (e.g., documents, sentences), similarity statistics for words and documents, exporting to DTM for compatibility with many text analysis packages, and the possibility to reconstruct original text from tokens to facilitate interpretation.

Authors:Kasper Welbers and Wouter van Atteveldt

corpustools_0.5.1.tar.gz
corpustools_0.5.1.zip(r-4.5)corpustools_0.5.1.zip(r-4.4)corpustools_0.5.1.zip(r-4.3)
corpustools_0.5.1.tgz(r-4.4-x86_64)corpustools_0.5.1.tgz(r-4.4-arm64)corpustools_0.5.1.tgz(r-4.3-x86_64)corpustools_0.5.1.tgz(r-4.3-arm64)
corpustools_0.5.1.tar.gz(r-4.5-noble)corpustools_0.5.1.tar.gz(r-4.4-noble)
corpustools_0.5.1.tgz(r-4.4-emscripten)corpustools_0.5.1.tgz(r-4.3-emscripten)
corpustools.pdf |corpustools.html
corpustools/json (API)
NEWS

# Install 'corpustools' in R:
install.packages('corpustools', repos = c('https://kasperwelbers.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/kasperwelbers/corpustools/issues

Uses libs:
  • c++– GNU Standard C++ Library v3
Datasets:

On CRAN:

7.37 score 29 stars 1 packages 174 scripts 1.5k downloads 50 exports 43 dependencies

Last updated 2 months agofrom:7610e29d57. Checks:OK: 1 NOTE: 8. Indexed: yes.

TargetResultDate
Doc / VignettesOKNov 04 2024
R-4.5-win-x86_64NOTENov 04 2024
R-4.5-linux-x86_64NOTENov 04 2024
R-4.4-win-x86_64NOTENov 04 2024
R-4.4-mac-x86_64NOTENov 04 2024
R-4.4-mac-aarch64NOTENov 04 2024
R-4.3-win-x86_64NOTENov 04 2024
R-4.3-mac-x86_64NOTENov 04 2024
R-4.3-mac-aarch64NOTENov 04 2024

Exports:agg_labelagg_tcorpusaggregate_rsyntaxas.tcorpusbackbone_filterbrowse_hitsbrowse_textscompare_corpuscompare_documentscompare_subsetcount_tcorpuscreate_tcorpusdocfreq_filterdtm_wordcloudego_semnetexport_span_annotationsfeature_associationsfeature_statsfold_rsyntaxfreq_filterget_dfmget_dtmget_kwicget_stopwordslaplacemelt_quanteda_dictmerge_tcorporaplot_semnetplot_wordspreprocess_tokensrefresh_tcorpussearch_contextssearch_dictionarysearch_featuressemnetsemnet_windowset_network_attributesshow_udpipe_modelssubset_querytc_plot_treetCorpustokens_to_tcorpustop_featurestransform_rsyntaxudpipe_clause_tqueriesudpipe_quote_tqueriesudpipe_simplifyudpipe_spanquote_tqueriesudpipe_tcorpusuntokenize

Dependencies:base64encclicolorspacecpp11data.tabledigestfarverfastmatchglueigraphISOcodesjsonlitelabelinglatticelifecyclemagrittrMatrixmunsellpbapplypkgconfigpngquantedaR6RColorBrewerRcppRcppEigenRcppProgressrlangRNewsflowrsyntaxscalesSnowballCstopwordsstringitidyselecttokenbrowserudpipevctrsviridisLitewithrwordcloudxml2yaml

corpustools: Managing, Querying and Analyzing Tokenized Text

Rendered fromcorpustools.Rmdusingknitr::rmarkdownon Nov 04 2024.

Last update: 2021-05-25
Started: 2019-08-15

Readme and manuals

Help Manual

Help pageTopics
Choose and add multitoken strings based on multitoken categoriesadd_multitoken_label
Helper function for aggregate_rsyntaxagg_label
Aggregate the tokens dataagg_tcorpus
Aggregate rsyntax annotationsaggregate_rsyntax
Force an object to be a tCorpus classas.tcorpus
Force an object to be a tCorpus classas.tcorpus.default
Force an object to be a tCorpus classas.tcorpus.tCorpus
Extract the backbone of a network.backbone_filter
View hits in a browserbrowse_hits
Create and view a full text browserbrowse_texts
Vectorized computation of chi^2 statistic for a 2x2 crosstab containing the values [a, b] [c, d]calc_chi2
Compare tCorpus vocabulary to that of another (reference) tCorpuscompare_corpus
Calculate the similarity of documentscompare_documents
Compare vocabulary of a subset of a tCorpus to the rest of the tCorpuscompare_subset
coreNLP example sentencescorenlp_tokens
Count results of search hits, or of a given feature in tokenscount_tcorpus
Create a tCorpuscreate_tcorpus create_tcorpus.character create_tcorpus.corpus create_tcorpus.data.frame create_tcorpus.factor
Support function for subset methoddocfreq_filter
Compare two document term matricesdtm_compare
Plot a word cloud from a dtmdtm_wordcloud
Create an ego networkego_semnet
Export span annotationsexport_span_annotations
Get common nearby features given a query or query hitsfeature_associations
Feature statisticsfeature_stats
Fold rsyntax annotationsfold_rsyntax
Support function for subset methodfreq_filter
Create a document term matrix.get_dfm get_dtm
Compute global feature positionsget_global_i
Get keyword-in-context (KWIC) stringsget_kwic
Get a character vector of stopwordsget_stopwords
Laplace (i.e. add constant) smoothinglaplace
Convert a quanteda dictionary to a long data.table formatmelt_quanteda_dict
Merge tCorpus objectsmerge_tcorpora
Visualize a semnet networkplot_semnet
Plot a wordcloud with words ordered and coloured according to a dimension (x)plot_words
S3 plot for contextHits classplot.contextHits
visualize feature associationsplot.featureAssociations
S3 plot for featureHits classplot.featureHits
visualize vocabularyComparisonplot.vocabularyComparison
Preprocess tokens in a character vectorpreprocess_tokens
S3 print for contextHits classprint.contextHits
S3 print for featureHits classprint.featureHits
S3 print for tCorpus classprint.tCorpus
Refresh a tCorpus object using the current version of corpustoolsrefresh_tcorpus
Check if package with given version existsrequire_package
Search for documents or sentences using Boolean queriessearch_contexts
Dictionary lookupsearch_dictionary
Find tokens using a Lucene-like search querysearch_features
Create a semantic network based on the co-occurence of tokens in documentssemnet
Create a semantic network based on the co-occurence of tokens in token windowssemnet_window
Set some default network attributes for pretty plottingset_network_attributes
Simple Good Turing smoothingsgt
Show the names of udpipe modelsshow_udpipe_models
State of the Union addressessotu_texts
Basic stopword listsstopwords_list
Subset tCorpus token data using a querysubset_query
S3 subset for tCorpus classsubset.tCorpus
S3 summary for contextHits classsummary.contextHits
S3 summary for featureHits classsummary.featureHits
Summary of a tCorpus objectsummary.tCorpus
Visualize a dependency treetc_plot_tree
A tCorpus with a small sample of sotu paragraphs parsed with udpipetc_sotu_udpipe
tCorpus: a corpus class for tokenized textstCorpus tcorpus
Corpus comparisontCorpus_compare
Creating a tCorpustCorpus_create
Methods and functions for viewing, modifying and subsetting tCorpus datatCorpus_data
Document similaritytCorpus_docsim
Preprocessing, subsetting and analyzing featurestCorpus_features
Modify tCorpus by referencetCorpus_modify_by_reference
Use Boolean queries to analyze the tCorpustCorpus_querying
Feature co-occurrence based semantic network analysistCorpus_semnet
Topic modelingtCorpus_topmod
Annotate tokens based on rsyntax queriesannotate_rsyntax tCorpus$annotate_rsyntax
Dictionary lookupcode_dictionary tCorpus$code_dictionary
Code features in a tCorpus based on a search stringcode_features tCorpus$code_features
Get a context vectorcontext tCorpus$context
Deduplicate documentsdeduplicate tCorpus$deduplicate
Delete column from the data and meta datadelete_columns delete_meta_columns tCorpus$delete_columns tCorpus$delete_meta_columns
Cast the "feats" column in UDpipe tokens to columnsfeats_to_columms tCorpus$feats_to_columns
Filter featuresfeature_subset tCorpus$feature_subset
Fold rsyntax annotationstCorpus$fold_rsyntax
Access the data from a tCorpusget get_meta tCorpus$get tCorpus$get_meta
Estimate a LDA topic modellda_fit tCorpus$lda_fit
Merge the token and meta data.tables of a tCorpus with another data.framemerge merge_meta tCorpus$merge
Preprocess featurepreprocess tCorpus$preprocess
Replace tokens with dictionary matchreplace_dictionary tCorpus$replace_dictionary
Recode features in a tCorpus based on a search stringsearch_recode tCorpus$search_recode
Modify the token and meta data.tables of a tCorpusset set_meta tCorpus$set tCorpus$set_meta
Change levels of factor columnsset_levels set_meta_levels tCorpus$set_levels tCorpus$set_meta_levels
Change column names of data and meta dataset_meta_name set_name tCorpus$set_meta_name tCorpus$set_name
Subset a tCorpussubset subset_meta tCorpus$subset tCorpus$subset_meta
Subset tCorpus token data using a querytCorpus$subset_query
Add columns indicating who did whattCorpus$udpipe_clauses udpipe_clauses
Add columns indicating who said whattCorpus$udpipe_quotes udpipe_quotes
Create a tcorpus based on tokens (i.e. preprocessed texts)tokens_to_tcorpus
Gives the window in which a term occured in a matrix.tokenWindowOccurence
Show top featurestop_features
Apply rsyntax transformationstransform_rsyntax
Get a list of tqueries for extracting who did whatudpipe_clause_tqueries
Get a list of tqueries for extracting quotesudpipe_quote_tqueries
Simplify tokenIndex created with the udpipe parserudpipe_simplify
Get a list of tqueries for finding candidates for span quotes.udpipe_spanquote_tqueries
Create a tCorpus using udpipeudpipe_tcorpus udpipe_tcorpus.character udpipe_tcorpus.corpus udpipe_tcorpus.data.frame udpipe_tcorpus.factor
Reconstruct original textsuntokenize