Package 'RNewsflow'

Title: Tools for Comparing Text Messages Across Time and Media
Description: A collection of tools for measuring the similarity of text messages and tracing the flow of messages over time and across media.
Authors: Kasper Welbers & Wouter van Atteveldt
Maintainer: Kasper Welbers <kasperwelbers@gmail.com>
License: GPL-3
Version: 1.2.8
Built: 2025-02-27 03:31:45 UTC
Source: https://github.com/kasperwelbers/rnewsflow

Help Index


Create a document similarity network

Description

This function can be used to structure the output of the compare_documents function as an igraph network.

Usage

as_document_network(el)

Arguments

el

An RNewsflow_edgelist object, as created with compare_documents.

Value

A network/graph in the igraph class

Examples

dtm = quanteda::dfm_tfidf(rnewsflow_dfm)
el = compare_documents(dtm, date_var='date', hour_window = c(0.1, 36))

g = as_document_network(el)
g

Compare the documents in a dtm

Description

This function calculates document similarity scores using a vector space approach. The most important benefit is that it includes options for limiting the number of comparisons that need to be made and filtering the results, that are efficiently implemented in a custom inner product calculation. This makes it possible to compare a huge number of documents, especially for cases where only documents witihin a given time window need to be compared.

Usage

compare_documents(
  dtm,
  dtm_y = NULL,
  date_var = NULL,
  hour_window = c(-24, 24),
  group_var = NULL,
  measure = c("cosine", "overlap_pct", "overlap", "dot_product", "softcosine",
    "cp_lookup", "cp_lookup_norm"),
  tf_idf = F,
  min_similarity = 0,
  n_topsim = NULL,
  only_complete_window = T,
  copy_meta = T,
  backbone_p = 1,
  simmat = NULL,
  simmat_thres = NULL,
  batchsize = 1000,
  verbose = FALSE
)

Arguments

dtm

A quanteda dfm. Note that it is common to first weight the dtm(s) before calculating document similarity, For this you can use quanteda's dfm_tfidf and dfm_weight

dtm_y

Optionally, another dtm. If given, the documents in dtm will be compared to the documents in dtm_y.

date_var

Optionally, the name of the column in docvars that specifies the document date. The values should be of type POSIXct, or coercable with as.POSIXct. If given, the hour_window argument is used to only compare documents within a time window.

hour_window

A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours. It is possible to specify time windows down to the level of seconds by using fractions (hours / 60 / 60).

group_var

Optionally, The name of the column in docvars that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared.

measure

The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document), "overlap" (like overlap_pct, but as the sum of overlap instead of the percentage) and the symmetrical soft cosine measure (experimental). The regular dot product (dot_product) is also supported.

tf_idf

If TRUE, weigh the dtm (and dtm_y) by term frequency - inverse document frequency. For more control over weighting, we recommend using quanteda's dfm_tfidf or dfm_weight on dtm and dtm_y.

min_similarity

A threshold for similarity. lower values are deleted. For all available similarity measures zero means no similarity.

n_topsim

An alternative or additional sort of threshold for similarity. Only keep the [n_topsim] highest similarity scores for x. Can return more than [n_topsim] similarity scores in the case of duplicate similarities.

only_complete_window

If True, only compare articles (x) of which a full window of reference articles (y) is available. Thus, for the first and last [window.size] days, there will be no results for x.

copy_meta

If TRUE, copy the dtm docvars to the from_meta and to_meta data.tables

backbone_p

Apply backbone filtering with a "disparity" filter (see Serrano et al., DOI: 10.1073/pnas.0808904106). It is different from the original disparity filter algorithm in that it only looks at outward edges. Also, the outward degree k is measured as all possible edges (within a window), not just the non-zero edges.

simmat

If softcosine is used, a symmetrical matrix with the similarity scores of terms. If NULL, the cosine similarity of terms in dtm will be used

simmat_thres

A large, dense simmat can lead to memory problems and slows down computation. A pragmatig (though not mathematically pure) solution is to use a threshold to prune small similarities.

batchsize

For internal use (testing)

verbose

If TRUE, report progress

Details

By default, the function performs a regular tcrossprod of the dtm (with itself or with dtm_y). The following parameters can be set to limit comparisons and filter output:

  • If the 'date_var' is specified. The given hour_window is used to only compare documents within the specified time distance.

  • If the 'group_var' is specified, only documents for which the group is identical will be compared.

  • With the 'min_similarity' argument, the output can be filtered with a minimum similarity threshold. For the inner product of two DTMs the size of the output matrix is often the main bottleneck for comparing many documents, because it generally increases exponentially with the number of documents in the DTMs. Even a low similarity threshold can greatly reduce the size of the output

  • As an alternative or additional filter, you can limit the results for each row in dtm to the highest top_n similarity scores

Margin attributes are also included in the output in the from_meta and to_meta data.tables (see details). If copy_meta = TRUE, The dtm docvars are also included in from_meta and to_meta.

Margin attributes are added to the meta data. The reason for including this is that some values that are normally available in a similarity matrix are missing if certain filter options are used. If group or date is used, we don't know how many columns a rows has been compared to (normally this is all columns). If a min/max or top_n filter is used, we don't know the true row sums (and thus row means). The meta data therefore includes the "row_n", "row_sum", "col_n", and "col_sum". In addition, there are "lag_n" and "lag_sum". this is a special case where row_n and row_sum are calculated for only matches where the column date < row date (lag). This can be used for more refined calculations of edge probabilities before and after a row document.

Value

A S3 class for RNewsflow_edgelist, which is a list with the edgelist, from_meta and to_meta data.tables.

Examples

dtm = quanteda::dfm_tfidf(rnewsflow_dfm)
el = compare_documents(dtm, date_var='date', hour_window = c(0.1, 36))


d = data.frame(text = c('a b c d e', 
                        'e f g h i j k',
                        'a b c'),
               date = as.POSIXct(c('2010-01-01','2010-01-01','2012-01-01')), 
               stringsAsFactors=FALSE)
corp = quanteda::corpus(d, text_field='text')
dtm = quanteda::tokens(corp) |>
  quanteda::dfm()

g = compare_documents(dtm)
g

g = compare_documents(dtm, measure = 'overlap_pct')
g

Create a document similarity network

Description

Combines document similarity data (d) with document meta data (meta) into an igraph network/graph.

Usage

create_document_network(
  d,
  meta,
  id_var = "document_id",
  date_var = "date",
  min_similarity = NA
)

Arguments

d

A data.frame with three columns, that represents an edgelist with weight values. The first and second column represent the names/ids of the 'from' and 'to' documents/vertices. The third column represents the similarity score. Column names are ignored

meta

A data.frame where rows are documents and columns are document meta information. Should at least contain 2 columns: the document name/id and date. The name/id column should match the document names/ids of the edgelist, and its label is specified in the 'id_var' argument. The date column should be intepretable with as.POSIXct, and its label is specified in the 'date_var' argument.

id_var

The label for the document name/id column in the 'meta' data.frame. Default is "document_id"

date_var

The label for the document date column in the 'meta' data.frame . default is "date"

min_similarity

For convenience, ignore all edges where the weight is below 'min_similarity'.

Details

This function is mainly offered to mimic the output of the as_document_network function when using imported document similarity data. This way the functions for transforming, aggregating and visualizing the document similarity data can be used.

Value

A network/graph in the igraph class

Examples

d = data.frame(x = c(1,1,1,2,2,3),
               y = c(2,3,5,4,5,6),
               v = c(0.3,0.4,0.7,0.5,0.2,0.9))

meta = data.frame(document_id = 1:8,
                  date = seq.POSIXt(from = as.POSIXct('2010-01-01 12:00:00'), 
                         by='hour', length.out = 8),
                  medium = c(rep('Newspapers', 4), rep('Blog', 4)))

g = create_document_network(d, meta)

igraph::get.data.frame(g, 'both')
igraph::plot.igraph(g)

Automatically infer queries from combinations of terms in a dtm

Description

This function was designed for the task of matching short event descriptions to news articles, but can more generally be used for document matching tasks. However, it should be noted that it will require exponentially more memory for dtms with more unique terms, which is why it is less suitable for matching larger documents. This only applies to the dtm, not the ref_dtm. Thus, if your goal is to match smaller documents such as event descriptions to news, this function might be usefull.

Usage

create_queries(
  dtm,
  ref_dtm = NULL,
  min_docfreq = 2,
  max_docprob = 0.01,
  weight = c("tfidf", "binary"),
  norm_weight = c("max", "doc_max", "dtm_max", "none"),
  min_obs_exp = NA,
  union_sim_thres = NA,
  combine_all = T,
  only_dtm_combs = T,
  use_dtm_and_ref = F,
  verbose = F
)

Arguments

dtm

A quanteda dfm

ref_dtm

Optionally, another quanteda dfm. If given, the ref_dtm will be used to calculate the docfreq/docprob scores.

min_docfreq

The minimum frequency for terms or combinations of terms

max_docprob

The maximum probability (document frequency / N) for terms or combinations of terms

weight

Determine how to weight the queries (if ref_dtm is used, uses the idf of the ref_dtm, or of both the dtm and ref dtm if use_dtm_and_ref is T). Default is "binary" (does/does not occur). "tfidf" uses common tf-idf weighting (actually just idf, since scores are binary).

norm_weight

Normalize the weight score so that the highest value is 1. If "max" is used, max is the highest possible value. "doc_max" uses the highest value within each document, and "dtm_max" uses the highest observed value in the dtm.

min_obs_exp

The minimum ratio of the observed and expected frequency of a term combination

union_sim_thres

If given, a number between 0 and 1, used as the cosine similarity threshold for combining clusters of terms

combine_all

If True, combine all terms. If False (default), terms that are included as unigrams (i.e. that are within the min_docfreq and max_docprob) are not combined with other terms.

only_dtm_combs

Only include term combinations that occur in dtm. This makes sense (and saves a lot of memory) if you are only interested in assymetric similarity measures based on the query

use_dtm_and_ref

if a ref_dtm is used, the weight is computed based only on the document frequencies in the ref dtm. If use_dtm_and_ref is set to TRUE, both the dtm and ref_dtm are used.

verbose

If true, report progress

Details

The main purpose of the function is that it intersects the terms in a dtm based to increase sparsity. This can improve certain document matching tasks, but at the cost of creating a bigger dtm. If all terms are combined this would be a quadratic increase of columns. However, only term combinations that occur in dtm (not ref_dtm) will be used. This is not a problem as long as the similarity of the documents in dtm to documents in dtm_y is calculated as an assymetric similarity measure (i.e. in which the sum of terms in dtm_y is not used).

To emphasize that this feature preparation step is geared towards the task of 'looking up' documents, we use the terminolog of a 'query'. The output of the function is a list of two dtm: query_dtm and ref_dtm. Both dtms have the exact same columns that contain the query terms. The values in query_dtm are by default tfidf weighted, and the values in ref_dtm are binary.

Several options are given to only create term combinations that are informative. Firstly, a minimum and maximum document frequency of term combinations can be defined. Secondly, a minimum observed/expected ratio can be given. The expected probability of a combination of term A and term B is the joint probability. If the observed probability is not higher, the combination is not more informative than chance. Thirdly, before intersecting terms, one can first cluster very similar terms together as single columns to reduce the number of possible combinations.

Value

a list with a query dtm and ref_dtm. Designed for use in compare_documents using the special 'query_lookup' measure

Examples

q = create_queries(rnewsflow_dfm, min_docfreq = 2, union_sim_thres = 0.9, 
                    max_docprob = 0.05, verbose = FALSE)
 head(colnames(q$query_dtm),100)

Delete duplicate (or similar) documents from a document term matrix

Description

Delete duplicate (or similar) documents from a document term matrix. Duplicates are defined by: having high content similarity, occuring within a given time distance and being published by the same source.

Usage

delete_duplicates(
  dtm,
  date_var = NULL,
  hour_window = c(-24, 24),
  group_var = NULL,
  measure = c("cosine", "overlap_pct"),
  similarity = 1,
  keep = "first",
  tf_idf = FALSE,
  dup_csv = NULL,
  verbose = F
)

Arguments

dtm

A quanteda dfm.

date_var

The name of the column in docvars(dtm) that specifies the document date. The values should be of type POSIXlt or POSIXct

hour_window

A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours.

group_var

Optionally, column name in docvars(dtm) that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared.

measure

The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), and the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document).

similarity

A threshold for similarity. Documents of which similarity is equal or higher are deleted

keep

A character indicating whether to keep the 'first' or 'last' published of duplicate documents.

tf_idf

If TRUE, weight the dtm with tf_idf before comparing documents. The original (non-weighted) DTM is returned.

dup_csv

Optionally, a path for writing a csv file with the duplicates edgelist. For each duplicate pair it is noted if "from" or "to" is the duplicate, or if "both" are duplicates (of other documents)

verbose

If TRUE, report progress

Details

Note that this can also be used to delete "updates" of articles (e.g., on news sites, news agencies). This should be considered if the temporal order of publications is relevant for the analysis.

Value

A dtm with the duplicate documents deleted

Examples

## example with very low similarity threshold (normally not recommended!)
dtm2 = delete_duplicates(rnewsflow_dfm, similarity = 0.5, keep='first', tf_idf = TRUE)

A wrapper for plot.igraph for visualizing directed networks.

Description

This is a convenience function for visualizing directed networks with edge labels using plot.igraph. It was designed specifically for visualizing aggregated document similarity networks in the RNewsflow package, but works with any network in the igraph class.

Usage

directed_network_plot(
  g,
  weight_var = "from.Vprop",
  weight_thres = NULL,
  delete_isolates = FALSE,
  vertex.size = 30,
  vertex.color = "lightblue",
  vertex.label.color = "black",
  vertex.label.cex = 0.7,
  edge.color = "grey",
  show.edge.labels = TRUE,
  edge.label.color = "black",
  edge.label.cex = 0.6,
  edge.arrow.size = 1,
  layout = igraph::layout.davidson.harel,
  ...
)

Arguments

g

A network/graph in the igraph class

weight_var

The edge attribute that is used to specify the edges

weight_thres

A threshold for weight. Edges below the threshold are ignored

delete_isolates

If TRUE, isolates (i.e. vertices without edges) are ignored.

vertex.size

The size of the verticex/nodes. Defaults to 30. Can be a vector with values per vertex.

vertex.color

Color of vertices/nodes. Default is lightblue. Can be a vector with values per vertex.

vertex.label.color

Color of labels for vertices/nodes. Defaults to black. Can be a vector with values per vertex.

vertex.label.cex

Size of the labels for vertices/nodes. Defaults to 0.7. Can be a vector with values per vertex.

edge.color

Color of the edges. Defaults to grey. Can be a vector with values per edge.

show.edge.labels

Logical. Should edge labels be displayed? Default is TRUE.

edge.label.color

Color of the edge labels. Defaults to black. Can be a vector with values per edge.

edge.label.cex

Size of the edge labels. Defaults to 0.6. Can be a vector with values per edge.

edge.arrow.size

Size of the edge arrows. Defaults to 1. Can only be set globally (igraph might update this at some point)

layout

The igraph layout used to plot the network. Defaults to layout.davidson.harel

...

Arguments to be passed to the plot.igraph function.

Value

Nothing

Examples

data(docnet)
aggdocnet = network_aggregate(docnet, by='source')
directed_network_plot(aggdocnet, weight_var = 'to.Vprop', weight_thres = 0.2)

Document similarity network for one news agency, and the print and online editions of two newspapers

Description

Document similarity network for one news agency, and the print and online editions of two newspapers

Format

docnet: A network/graph in the igraph class as created with create_document_network or newsflow_compare.


Visualize (a subcomponent) of the document similarity network

Description

Visualize (a subcomponent) of the document similarity network

Usage

document_network_plot(
  g,
  date_attribute = "date",
  source_attribute = "source",
  subcomp_i = NULL,
  dtm = NULL,
  sources = NULL,
  only_outer_date = FALSE,
  date_format = "%Y-%m-%d %H:%M",
  margins = c(5, 8, 1, 13),
  isolate_color = NULL,
  source_loops = TRUE,
  ...
)

Arguments

g

A document similarity network, as created with newsflow_compare or create_document_network

date_attribute

The label of the vertex/document date attribute. Default is "date"

source_attribute

The label of the vertex/document source attribute. Default is "source"

subcomp_i

Optional. If an integer is given, the network is decomposed into subcomponents (i.e. unconnected components) and only the i-th component is visualized.

dtm

Optional. If a document-term matrix that contains the documents in g is given, a wordcloud with the most common words of the network is added.

sources

Optional. Use a character vector to select only certain sources

only_outer_date

If TRUE, only the labels for the first and last date are reported on the x-axis

date_format

The date format of the date labels (see format.POSIXct)

margins

The margins of the network plot. The four values represent bottom, left, top and right margin.

isolate_color

Optional. Set a custom color for isolates

source_loops

If set to FALSE, all edges between vertices/documents of the same source are ignored.

...

Additional arguments for the network plotting function plot.igraph

Value

Nothing.

Examples

docnet = docnet
dtm = rnewsflow_dfm

docnet_comps = igraph::decompose.graph(docnet) # get subcomponents

# subcomponent 1
document_network_plot(docnet_comps[[1]]) 

# subcomponent 2 with wordcloud
document_network_plot(docnet_comps[[2]], dtm=dtm) 

# subcomponent 3 with additional arguments passed to plot.igraph 
document_network_plot(docnet_comps[[3]], dtm=dtm, vertex.color='red')

Filter edges from the document similarity network based on hour difference

Description

The 'filter_window' function can be used to filter the document pairs (i.e. edges) using the 'hour_window' parameter, which works identical to the 'hour_window' parameter in the 'newsflow_compare' function. In addition, the 'from_vertices' and 'to_vertices' parameters can be used to select the vertices (i.e. documents) for which this filter is applied.

Usage

filter_window(g, hour_window, to_vertices = NULL, from_vertices = NULL)

Arguments

g

A document similarity network, as created with newsflow_compare or create_document_network

hour_window

A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours.

to_vertices

A filter to select the vertices 'to' which an edge is filtered. For example, if 'V(g)$sourcetype == "newspaper"' is used, then the hour_window filter is only applied for edges 'to' newspaper documents (specifically, where the sourcetype attribute is "newspaper").

from_vertices

A filter to select the vertices 'from' which an edge is filtered. Works identical to 'to_vertices'.

Details

It is recommended to use the show_window function to verify whether the hour windows are correct according to the assumptions and focus of the study.

Value

A network/graph in the igraph class

Examples

data(docnet)
show_window(docnet, to_attribute = 'source') # before filtering

docnet = filter_window(docnet, hour_window = c(0.1,24))

docnet = filter_window(docnet, hour_window = c(6,36), 
                       to_vertices = V(docnet)$sourcetype == 'Print NP')

show_window(docnet, to_attribute = 'sourcetype') # after filtering per sourcetype
show_window(docnet, to_attribute = 'source') # after filtering per source

View term scores for a given document

Description

View term scores for a given document

Usage

get_doc_terms(dtm, docname = NULL, doc_i = NULL)

Arguments

dtm

A quanteda dfm

docname

name of document to select

doc_i

alternatively, select document by index

Value

A named vector with terms (names) and scores

Examples

get_doc_terms(rnewsflow_dfm, doc_i=1)

View overlapping terms for a given pair of documents

Description

View overlapping terms for a given pair of documents

Usage

get_overlap_terms(dtm, doc.x, doc.y, dtm.y = dtm)

Arguments

dtm

A quanteda dfm

doc.x

The name of the first document in dtm

doc.y

The name of the second document in dtm (or dtm.y)

dtm.y

Optionally, a second dtm (for when the documents occur in separate dtm's)

Value

A character vector

Examples

get_overlap_terms(rnewsflow_dfm, 
                  quanteda::docnames(rnewsflow_dfm)[1],
                  quanteda::docnames(rnewsflow_dfm)[5])

Inspect effects of thresholds on matches over time

Description

If it can be assumed that matches should only occur within a given time range (e.g., event data should match news items after the event occured) a low effort validation can be obtained by looking at whether the matches only occur within this time range. This function plots the percentage of matches within a given time range (hourdiff) for different thresholds of the weight column. This can be used to determine a good threshold.

Usage

hourdiff_range_thresholds(
  g,
  breaks = 20,
  hourdiff_range = c(0, Inf),
  min_weight = NA,
  min_hourdiff = NA,
  max_hourdiff = NA
)

Arguments

g

The output of newsflow.compare (either as "igraph" or "edgelist")

breaks

The number of breaks for the weight threshold

hourdiff_range

The time period (hourdiff range) in which the match 'should' occur.

min_weight

Optionally, filter out all value below the given weight

min_hourdiff

the lowest possible hourdiff value. This is used to estimate noise. If not specified, will be estimated based on data.

max_hourdiff

the highest possible hourdiff value.

Value

Nothing... just plots


Aggregate the edges of a network by vertex attributes

Description

This function offers a versatile way to aggregate the edges of a network based on the vertex attributes. Although it was designed specifically for document similarity networks, it can be used for any network in the igraph class.

Usage

network_aggregate(
  g,
  by = NULL,
  by_from = by,
  by_to = by,
  edge_attribute = "weight",
  agg_FUN = mean,
  return_df = FALSE,
  keep_isolates = T
)

Arguments

g

A network/graph in the igraph class

by

A character string indicating the vertex attributes by which the edges will be aggregated.

by_from

Optionally, specify different vertex attributes to aggregate the 'from' side of edges

by_to

Optionally, specify different vertex attributes to aggregate the 'to' side of edges

edge_attribute

Select an edge attribute to aggregate using the function specified in ‘agg_FUN'. Defaults to ’weight'

agg_FUN

The function used to aggregate the edge attribute

return_df

Optional. If TRUE, the results are returned as a data.frame. This can in particular be convenient if by_from and by_to are used.

keep_isolates

if True, also return scores for isolates

Details

The first argument is the network (in the 'igraph' class). The second argument, for the 'by' parameter, is a character vector to indicate one or more vertex attributes based on which the edges are aggregated. Optionally, the 'by' parameter can also be specified separately for 'by_from' and 'by_to'.

By default, the function returns the aggregated network as an igraph class. The edges in the aggregated network have five standard attributes. The 'edges' attribute counts the number of edges from the 'from' group to the 'to' group. The 'from.V' attribute shows the number of vertices in the 'from' group that matched with a vertex in the 'to' group. The 'from.Vprop attribute shows this as the proportion of all vertices in the 'from' group. The 'to.V' and 'to.Vprop' attributes show the same for the 'to' group.

In addition, one of the edge attributes of the original network can be aggregated with a given function. These are specified in the 'edge_attribute' and 'agg_FUN' parameters.

Value

A network/graph in the igraph class, or a data.frame if return_df is TRUE.

Examples

data(docnet)
aggdocnet = network_aggregate(docnet, by='sourcetype')
igraph::get.data.frame(aggdocnet, 'both')

aggdocdf = network_aggregate(docnet, by_from='sourcetype', by_to='source', return_df = TRUE)
head(aggdocdf)

Create a network of document similarities over time

Description

This is a wrapper for the compare_documents function, specialised for the case of analyzing documents over time. The difference is that using date_var is mandatory, and the output is returned as an igraph network (using as_document_network).

Usage

newsflow_compare(
  dtm,
  dtm_y = NULL,
  date_var = "date",
  hour_window = c(-24, 24),
  group_var = NULL,
  measure = c("cosine", "overlap_pct", "overlap", "dot_product", "softcosine"),
  tf_idf = F,
  min_similarity = 0,
  n_topsim = NULL,
  only_complete_window = T,
  ...
)

Arguments

dtm

A quanteda dfm. Note that it is common to first weight the dtm(s) before calculating document similarity, For this you can use quanteda's dfm_tfidf and dfm_weight

dtm_y

Optionally, another dtm. If given, the documents in dtm will be compared to the documents in dtm_y.

date_var

The name of the column in meta that specifies the document date. default is "date". The values should be of type POSIXct, or coercable with as.POSIXct. If given, the hour_window argument is used to only compare documents within a time window.

hour_window

A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours. It is possible to specify time windows down to the level of seconds by using fractions (hours / 60 / 60).

group_var

Optionally, The name of the column in meta that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared.

measure

The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document), "overlap" (like overlap_pct, but as the sum of overlap instead of the percentage) and the symmetrical soft cosine measure (experimental). The regular dot product (dot_product) is also supported.

tf_idf

If TRUE, weigh the dtm (and dtm_y) by term frequency - inverse document frequency. For more control over weighting, we recommend using quanteda's dfm_tfidf or dfm_weight on dtm and dtm_y.

min_similarity

A threshold for similarity. lower values are deleted. For all available similarity measures zero means no similarity.

n_topsim

An alternative or additional sort of threshold for similarity. Only keep the [n_topsim] highest similarity scores for x. Can return more than [n_topsim] similarity scores in the case of duplicate similarities.

only_complete_window

If True, only compare articles (x) of which a full window of reference articles (y) is available. Thus, for the first and last [window.size] days, there will be no results for x.

...

Other arguments passed to compare_documents.

Value

An igraph network.

Examples

dtm = quanteda::dfm_tfidf(rnewsflow_dfm)
el = newsflow_compare(dtm, date_var='date', hour_window = c(0.1, 36))

Transform document network so that each document only matches the earliest dated matching document

Description

Transforms the network so that a document only has an edge to the earliest dated document it matches within the specified time window[^duplicate].

Usage

only_first_match(g)

Arguments

g

A document similarity network, as created with newsflow_compare or create_document_network

Details

If there are multiple earliest dated documents (that is, having the same publication date) then edges to all earliest dated documents are kept.

Value

A network/graph in the igraph class

Examples

data(docnet)

subcomp1 = igraph::decompose.graph(docnet)[[2]]
subcomp2 = only_first_match(subcomp1)

igraph::get.data.frame(subcomp1)
igraph::get.data.frame(subcomp2)

graphics::par(mfrow=c(2,1))
document_network_plot(subcomp1, main='All matches')
document_network_plot(subcomp2, main='Only first match')
graphics::par(mfrow=c(1,1))

quanteda dfm for RNewsflow vignette demo

Description

quanteda dfm for RNewsflow vignette demo

Usage

rnewsflow_dfm

Format

dfm


Show time window of document pairs

Description

This function aggregates the edges for all combinations of attributes specified in 'from_attribute' and 'to_attribute', and shows the minimum and maximum hour difference for each combination.

Usage

show_window(g, to_attribute = NULL, from_attribute = NULL)

Arguments

g

A document similarity network, as created with newsflow_compare or create_document_network

to_attribute

The vertex attribute to aggregate the 'to' group of the edges

from_attribute

The vertex attribute to aggregate the 'from' group of the edges

Details

The filter_window function can be used to filter edges that fall outside of the intended time window.

Value

A data.frame showing the left and right edges of the window for each unique group.

Examples

data(docnet)
show_window(docnet, to_attribute = 'source')
show_window(docnet, to_attribute = 'sourcetype')
show_window(docnet, to_attribute = 'sourcetype', from_attribute = 'sourcetype')

tcrossprod with benefits, for people that like parameters

Description

This function (including the underlying cpp function batched_tcrossprod_cpp) is the workhorse of the RNewsflow package. It has unnervingly many arguments for a tcrossprod because it needs to be able to do many thing efficiently. While its mostly a backend function, we expose it because it has applications outside of RNewsflow, but we make no excuses for the fact that readability is very much sacrificed here for the convenience of being able to keep adding features that we need for RNewsflow.

Usage

tcrossprod_sparse(
  m,
  m2 = NULL,
  min_value = NULL,
  max_value = NULL,
  only_upper = F,
  diag = T,
  top_n = NULL,
  rowsum_div = F,
  max_p = 1,
  pvalue = c("disparity", "normal", "lognormal", "nz_normal", "nz_lognormal"),
  normalize = c("none", "l2", "softl2"),
  crossfun = c("prod", "min", "softprod", "maxproduct", "lookup", "cp_lookup",
    "cp_lookup_norm"),
  group = NULL,
  group2 = NULL,
  date = NULL,
  date2 = NULL,
  lwindow = -1,
  rwindow = 1,
  date_unit = c("days", "hours", "minutes", "seconds"),
  simmat = NULL,
  simmat_thres = NULL,
  row_attr = F,
  col_attr = F,
  lag_attr = F,
  batchsize = 1000,
  verbose = F
)

Arguments

m

A CsparseMatrix

m2

A CsparseMatrix

min_value

Optionally, a numerical value, specifying the threshold for including a score in the output.

max_value

Optionally, a numerical value for the upper limit for including a score in the output.

only_upper

If true, only the upper triangle of the matrix is returned. Only possible for symmetrical output (m and m2 have same number of columns)

diag

If false, the diagonal of the matrix is not returned. Only possible for symmetrical output (m and m2 have same number of columns)

top_n

An integer, specifying the top number of strongest similarities per row. So, for each row in m at most top_n scores are returned..

rowsum_div

If true, divide crossproduct by column sums of m. (this has to happen within the loop for min_value and top_n filtering).

max_p

A threshold for maximium p value.

pvalue

If max_p < 1, edges are removed based on a p value. For each document in dtm, a p value is calculated over its outward edges. Default is the p-value based on uniform distribution, akin to a "disparity" filter (see Serrano et al., DOI: 10.1073/pnas.0808904106) but without filtering on inward edges.

normalize

Normalize rows by a given norm score (before calculating similarity). Default is 'none' (no normalization). 'l2' is the l2 norm (use in combination with 'prod' crossfun for cosine similarity). 'l2soft' is the adaptation of l2 for soft similarity (use in combination with 'softprod' crossfun for soft cosine).

crossfun

The function used in the vector operations. Normally this is the "prod", for product (dot product). Here we also allow the "min", for minimum value. We use this in our document overlap_pct score. In addition, there is the (experimental) softprod, that can be used in combination with softl2 normalization to get the soft cosine similarity. The "maxproduct" is a special case used in the query_lookup measure, that uses product but only returns the score of the strongest matching term. The "cp_lookup" and "cp_lookup_norm" are special cases for conditional probability sensitive lookup.

group

Optionally, a character vector that specifies a group (e.g., source) for each row in m. If given, only pairs of rows with the same group are calculated.

group2

If m2 and group are used, group2 has to be used to specify the groups for the rows in m2 (otherwise group will be ignored)

date

Optionally, a POSIXct vector (or a vector that can be converted to as.POSIXct) that specifies a date for each row in m. If given, only pairs of rows within a given date range (see lwindow, rwindow and date_unit) are calculated.

date2

If m2 and date are used, date2 has to be used to specify the date for the rows in m2 (otherwise date will be ignored)

lwindow

If date (and date2) are used, lwindow determines the left side of the date window. e.g. -10 means that rows are only matched with rows for which date is within 10 [date_units] before.

rwindow

Like lwindow, but for the right side. e.g. an lwindow of -1 and rwindow of 1, with date_unit is "days", means that only rows are matched for which the dates are within a 1 day distance

date_unit

The date unit used in lwindow and rwindow. Supports "days", "hours", "minutes" and "seconds". Note that refers to the time distance between two rows ("days" doesn't refer to calendar days, but to a time of 24 hours)

simmat

If softcos is used, a symmetric matrix with terms that indicates the similarity of terms (i.e. adjacency matrix). If NULL, a cosine similarity matrix will be created on the go

simmat_thres

If softcos is used, a threshold for the term similarity.

row_attr

If TRUE, add the "row_n" and "row_sum" elements to the "margin" attribute.

col_attr

Like row_attr, but adding "col_n" and "col_sum" to the "margin" attribute.

lag_attr

If TRUE, adds "lag_n" and "lag_sum" to the "margin" attribute. These are the margin scores for rows, where the date of the column is before (lag) the date of the row. Only possible if date argument is given.

batchsize

If group and/or date are used, size of batches.

verbose

if TRUE, report progress

Details

Enables limiting row combinations to within specified groups and date windows, and filters results that do not pass the threshold on the fly. To achieve this, options for similarity measures are included in the function. For example, to get the cosine similarity, you can normalize with "l2" and use the "prod" (product) function for the

This function is called by the document comparison functions (newsflow_compare, delete_duplicates). We only expose it here for additional flexibility, and because it could be usefull outside of the purpose of this package.

The output matrix also has an attribute "margin", which contains margin scores (e.g., row_sum) if the row_attr or col_attr arguments are used. The reason for including this is that some values that are normally available in the output of a cross product are broken if certain filter options are used. If group or date is used, we don't know how many columns a rows has been compared to (normally this is all columns). If a min/max or top_n filter is used, we don't know the true row sums (and thus row means).

Value

A CsparseMatrix

Examples

set.seed(1)
m = Matrix::rsparsematrix(5,10,0.5)
tcrossprod_sparse(m, min_value = 0, only_upper = FALSE, diag = TRUE)
tcrossprod_sparse(m, min_value = 0, only_upper = FALSE, diag = FALSE)
tcrossprod_sparse(m, min_value = 0, only_upper = TRUE, diag = FALSE)
tcrossprod_sparse(m, min_value = 0.2, only_upper = TRUE, diag = FALSE)
tcrossprod_sparse(m, min_value = 0, only_upper = TRUE, diag = FALSE, top_n = 1)

Find terms with similar spelling

Description

A quick, language agnostic way for finding terms with similar spelling. Calculates similarity as percentage of a terms bigram's or trigram's that also occur in the other term. The percentage has to be above the given threshold for both terms (unless allow_asym = T)

Usage

term_char_sim(
  voc,
  type = c("tri", "bi"),
  min_overlap = 2/3,
  max_diff = 4,
  pad = F,
  as_lower = T,
  same_start = 1,
  drop_non_alpha = T,
  min_length = 5,
  allow_asym = F,
  verbose = T
)

Arguments

voc

A character vector that gives the vocabulary (e.g., colnames of a dtm)

type

Either "bi" (bigrams) or "tri" (trigrams)

min_overlap

The minimal overlap percentage. Works together with max_diff to determine required overlap

max_diff

The maximum number of bi/tri-grams that is different

pad

If True, pad the left size (ls) and right side (rs) of bi/tri-grams. So, trigrams for "pad" would be: "ls_ls_p", "ls_p_a", "p_a_d", "a_d_rs", "d_rs_rs".

as_lower

If True, ignore case

same_start

Should terms start with the same character(s)? Given as a number for the number of same characters. (also greatly speeds up calculation)

drop_non_alpha

If True, ignore non alpha terms (e.g., numbers, punctuation). They will appear in the output matrix, but only with zeros.

min_length

The minimum number of characters in a term. Terms with fewer characters are ignored. They will appear in the output matrix, but only with zeros.

allow_asym

If True, the match only needs to be true for at least one term. In practice, this means that "America" would match perfectly with "Southern-America".

verbose

If True, report progress

Value

A similarity matrix in the CsparseMatrix format

Examples

dfm = quanteda::tokens(c('That guy Gadaffi','Do you mean Kadaffi?',
                         'Nah more like Gadaffel','What Gargamel?')) |>
  quanteda::dfm()
simmat = term_char_sim(colnames(dfm), same_start=0)
term_union(dfm, simmat, verbose = FALSE)

Calculate statistics for term occurence across days

Description

Calculate statistics for term occurence across days

Usage

term_day_dist(dtm, meta = NULL, date.var = "date")

Arguments

dtm

A quanteda dfm. Alternatively, a DocumentTermMatrix from the tm package can be used, but then the meta parameter needs to be specified manually

meta

If dtm is a quanteda dfm, docvars(meta) is used by default (meta is NULL) to obtain the meta data. Otherwise, the meta data.frame has to be given by the user, with the rows of the meta data.frame matching the rows of the dtm (i.e. each row is a document)

date.var

The name of the meta column specifying the document date. default is "date". The values should be of type POSIXlt or POSIXct

Value

A data.frame with statistics for each term.

Examples

tdd = term_day_dist(rnewsflow_dfm, date.var='date')
head(tdd)
tail(tdd)

Experimental: Convert dtm scores to a term innovation score, based on changes in term use over time

Description

For each term in m, the usage before and after the document date is compared (with a chi2 test) to see whether usage increased.

Usage

term_innovation(
  m,
  date,
  m2 = NULL,
  date2 = NULL,
  lwindow = -7,
  rwindow = 7,
  date_unit = c("days", "hours", "minutes", "seconds"),
  min_chi = 5.024,
  min_ratio = 2,
  smooth = 1
)

Arguments

m

A CsparseMatrix

date

a character vector that specifies a date for each row in m. If given, only pairs of rows within a given date range (see lwindow, rwindow and date_unit) are calculated.

m2

Optionally, use a different matrix for calculating the innovation scores. For example, if m is a DTM of press releases, m2 can be a DTM of news articles, to see if term usage increased in the news after the press release.

date2

If m2 is used, date2 has to be used to specify the date for the rows in m2 (otherwise date will be ignored)

lwindow

If date (and date2) are used, lwindow determines the left side of the date window. e.g. -10 means that rows are only matched with rows for which date is within 10 [date_units] before.

rwindow

Like lwindow, but for the right side. e.g. an lwindow of -1 and rwindow of 1, with date_unit is "days", means that only rows are matched for which the dates are within a 1 day distance

date_unit

The date unit used in lwindow and rwindow. Supports "days", "hours", "minutes" and "seconds". Note that refers to the time distance between two rows ("days" doesn't refer to calendar days, but to a time of 24 hours)

min_chi

The minimum chi-square value

min_ratio

The minimum ratio (rwindow score / lwindow score)

smooth

The smoothing factor (prevents -Inf/Inf ratio)

Value

A CsparseMatrix


Combine terms in a dtm

Description

Given a dtm and a similarity (adjacency) matrix, create a new column for each nonzero cell in the similarity matrix. For the term combinations (everything except the diagonal) the column names will be pasted together with a "&" separator (read as AND)

Usage

term_intersect(dtm, simmat, as_dfm = T, verbose = F, sep = " & ", par = NA)

Arguments

dtm

A quanteda dfm or a CsparseMatrix.

simmat

A similarity matrix in CsparseMatrix format. For instance, created with term_char_sim

as_dfm

If True, return as quanteda dfm

verbose

If True, report progress

sep

The separator used for pasting the terms

par

If TRUE, add parentheses to colnames before combining. This is mainly for internal use, as it allows specification if OR (term_union) and AND (term_intersect) operations are combined. If NA, this is based on whether parenthese are present.

Value

A CsparseMatrix or quanteda dfm


Combine terms in a dtm

Description

Given a dtm and a similarity (adjacency) matrix, group clusters of similar terms (simmat > 0) into a single column. Column names will be concatenated, with a "|" seperator (read as OR)

Usage

term_union(dtm, simmat, as_dfm = T, verbose = F, sep = "|", par = NA)

Arguments

dtm

A quanteda dfm or a CsparseMatrix.

simmat

A similarity matrix in CsparseMatrix format. For instance, created with term_char_sim

as_dfm

If True, return as quanteda dfm

verbose

If True, report progress

sep

The separator used for pasting the terms

par

If TRUE, add parentheses to colnames before combining. This is mainly for internal use, as it allows specification if OR (term_union) and AND (term_intersect) operations are combined. If NA, this is based on whether parenthese are present.

Value

A CsparseMatrix or quanteda dfm

Examples

dfm = quanteda::tokens(c('That guy Gadaffi','Do you mean Kadaffi?',
                         'Nah more like Gadaffel','Not Kadaffel?')) |>
  quanteda::dfm()
simmat = term_char_sim(colnames(dfm), same_start=0)
term_union(dfm, simmat, verbose = FALSE)