Title: | Tools for Comparing Text Messages Across Time and Media |
---|---|
Description: | A collection of tools for measuring the similarity of text messages and tracing the flow of messages over time and across media. |
Authors: | Kasper Welbers & Wouter van Atteveldt |
Maintainer: | Kasper Welbers <[email protected]> |
License: | GPL-3 |
Version: | 1.2.8 |
Built: | 2025-02-27 03:31:45 UTC |
Source: | https://github.com/kasperwelbers/rnewsflow |
This function can be used to structure the output of the compare_documents function as an igraph network.
as_document_network(el)
as_document_network(el)
el |
An RNewsflow_edgelist object, as created with compare_documents. |
A network/graph in the igraph class
dtm = quanteda::dfm_tfidf(rnewsflow_dfm) el = compare_documents(dtm, date_var='date', hour_window = c(0.1, 36)) g = as_document_network(el) g
dtm = quanteda::dfm_tfidf(rnewsflow_dfm) el = compare_documents(dtm, date_var='date', hour_window = c(0.1, 36)) g = as_document_network(el) g
This function calculates document similarity scores using a vector space approach. The most important benefit is that it includes options for limiting the number of comparisons that need to be made and filtering the results, that are efficiently implemented in a custom inner product calculation. This makes it possible to compare a huge number of documents, especially for cases where only documents witihin a given time window need to be compared.
compare_documents( dtm, dtm_y = NULL, date_var = NULL, hour_window = c(-24, 24), group_var = NULL, measure = c("cosine", "overlap_pct", "overlap", "dot_product", "softcosine", "cp_lookup", "cp_lookup_norm"), tf_idf = F, min_similarity = 0, n_topsim = NULL, only_complete_window = T, copy_meta = T, backbone_p = 1, simmat = NULL, simmat_thres = NULL, batchsize = 1000, verbose = FALSE )
compare_documents( dtm, dtm_y = NULL, date_var = NULL, hour_window = c(-24, 24), group_var = NULL, measure = c("cosine", "overlap_pct", "overlap", "dot_product", "softcosine", "cp_lookup", "cp_lookup_norm"), tf_idf = F, min_similarity = 0, n_topsim = NULL, only_complete_window = T, copy_meta = T, backbone_p = 1, simmat = NULL, simmat_thres = NULL, batchsize = 1000, verbose = FALSE )
dtm |
A quanteda dfm. Note that it is common to first weight the dtm(s) before calculating document similarity, For this you can use quanteda's dfm_tfidf and dfm_weight |
dtm_y |
Optionally, another dtm. If given, the documents in dtm will be compared to the documents in dtm_y. |
date_var |
Optionally, the name of the column in docvars that specifies the document date. The values should be of type POSIXct, or coercable with as.POSIXct. If given, the hour_window argument is used to only compare documents within a time window. |
hour_window |
A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours. It is possible to specify time windows down to the level of seconds by using fractions (hours / 60 / 60). |
group_var |
Optionally, The name of the column in docvars that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared. |
measure |
The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document), "overlap" (like overlap_pct, but as the sum of overlap instead of the percentage) and the symmetrical soft cosine measure (experimental). The regular dot product (dot_product) is also supported. |
tf_idf |
If TRUE, weigh the dtm (and dtm_y) by term frequency - inverse document frequency. For more control over weighting, we recommend using quanteda's dfm_tfidf or dfm_weight on dtm and dtm_y. |
min_similarity |
A threshold for similarity. lower values are deleted. For all available similarity measures zero means no similarity. |
n_topsim |
An alternative or additional sort of threshold for similarity. Only keep the [n_topsim] highest similarity scores for x. Can return more than [n_topsim] similarity scores in the case of duplicate similarities. |
only_complete_window |
If True, only compare articles (x) of which a full window of reference articles (y) is available. Thus, for the first and last [window.size] days, there will be no results for x. |
copy_meta |
If TRUE, copy the dtm docvars to the from_meta and to_meta data.tables |
backbone_p |
Apply backbone filtering with a "disparity" filter (see Serrano et al., DOI: 10.1073/pnas.0808904106). It is different from the original disparity filter algorithm in that it only looks at outward edges. Also, the outward degree k is measured as all possible edges (within a window), not just the non-zero edges. |
simmat |
If softcosine is used, a symmetrical matrix with the similarity scores of terms. If NULL, the cosine similarity of terms in dtm will be used |
simmat_thres |
A large, dense simmat can lead to memory problems and slows down computation. A pragmatig (though not mathematically pure) solution is to use a threshold to prune small similarities. |
batchsize |
For internal use (testing) |
verbose |
If TRUE, report progress |
By default, the function performs a regular tcrossprod of the dtm (with itself or with dtm_y). The following parameters can be set to limit comparisons and filter output:
If the 'date_var' is specified. The given hour_window is used to only compare documents within the specified time distance.
If the 'group_var' is specified, only documents for which the group is identical will be compared.
With the 'min_similarity' argument, the output can be filtered with a minimum similarity threshold. For the inner product of two DTMs the size of the output matrix is often the main bottleneck for comparing many documents, because it generally increases exponentially with the number of documents in the DTMs. Even a low similarity threshold can greatly reduce the size of the output
As an alternative or additional filter, you can limit the results for each row in dtm to the highest top_n similarity scores
Margin attributes are also included in the output in the from_meta and to_meta data.tables (see details). If copy_meta = TRUE, The dtm docvars are also included in from_meta and to_meta.
Margin attributes are added to the meta data. The reason for including this is that some values that are normally available in a similarity matrix are missing if certain filter options are used. If group or date is used, we don't know how many columns a rows has been compared to (normally this is all columns). If a min/max or top_n filter is used, we don't know the true row sums (and thus row means). The meta data therefore includes the "row_n", "row_sum", "col_n", and "col_sum". In addition, there are "lag_n" and "lag_sum". this is a special case where row_n and row_sum are calculated for only matches where the column date < row date (lag). This can be used for more refined calculations of edge probabilities before and after a row document.
A S3 class for RNewsflow_edgelist, which is a list with the edgelist, from_meta and to_meta data.tables.
dtm = quanteda::dfm_tfidf(rnewsflow_dfm) el = compare_documents(dtm, date_var='date', hour_window = c(0.1, 36)) d = data.frame(text = c('a b c d e', 'e f g h i j k', 'a b c'), date = as.POSIXct(c('2010-01-01','2010-01-01','2012-01-01')), stringsAsFactors=FALSE) corp = quanteda::corpus(d, text_field='text') dtm = quanteda::tokens(corp) |> quanteda::dfm() g = compare_documents(dtm) g g = compare_documents(dtm, measure = 'overlap_pct') g
dtm = quanteda::dfm_tfidf(rnewsflow_dfm) el = compare_documents(dtm, date_var='date', hour_window = c(0.1, 36)) d = data.frame(text = c('a b c d e', 'e f g h i j k', 'a b c'), date = as.POSIXct(c('2010-01-01','2010-01-01','2012-01-01')), stringsAsFactors=FALSE) corp = quanteda::corpus(d, text_field='text') dtm = quanteda::tokens(corp) |> quanteda::dfm() g = compare_documents(dtm) g g = compare_documents(dtm, measure = 'overlap_pct') g
Combines document similarity data (d) with document meta data (meta) into an igraph network/graph.
create_document_network( d, meta, id_var = "document_id", date_var = "date", min_similarity = NA )
create_document_network( d, meta, id_var = "document_id", date_var = "date", min_similarity = NA )
d |
A data.frame with three columns, that represents an edgelist with weight values. The first and second column represent the names/ids of the 'from' and 'to' documents/vertices. The third column represents the similarity score. Column names are ignored |
meta |
A data.frame where rows are documents and columns are document meta information. Should at least contain 2 columns: the document name/id and date. The name/id column should match the document names/ids of the edgelist, and its label is specified in the 'id_var' argument. The date column should be intepretable with as.POSIXct, and its label is specified in the 'date_var' argument. |
id_var |
The label for the document name/id column in the 'meta' data.frame. Default is "document_id" |
date_var |
The label for the document date column in the 'meta' data.frame . default is "date" |
min_similarity |
For convenience, ignore all edges where the weight is below 'min_similarity'. |
This function is mainly offered to mimic the output of the as_document_network function when using imported document similarity data. This way the functions for transforming, aggregating and visualizing the document similarity data can be used.
A network/graph in the igraph class
d = data.frame(x = c(1,1,1,2,2,3), y = c(2,3,5,4,5,6), v = c(0.3,0.4,0.7,0.5,0.2,0.9)) meta = data.frame(document_id = 1:8, date = seq.POSIXt(from = as.POSIXct('2010-01-01 12:00:00'), by='hour', length.out = 8), medium = c(rep('Newspapers', 4), rep('Blog', 4))) g = create_document_network(d, meta) igraph::get.data.frame(g, 'both') igraph::plot.igraph(g)
d = data.frame(x = c(1,1,1,2,2,3), y = c(2,3,5,4,5,6), v = c(0.3,0.4,0.7,0.5,0.2,0.9)) meta = data.frame(document_id = 1:8, date = seq.POSIXt(from = as.POSIXct('2010-01-01 12:00:00'), by='hour', length.out = 8), medium = c(rep('Newspapers', 4), rep('Blog', 4))) g = create_document_network(d, meta) igraph::get.data.frame(g, 'both') igraph::plot.igraph(g)
This function was designed for the task of matching short event descriptions to news articles, but can more generally be used for document matching tasks. However, it should be noted that it will require exponentially more memory for dtms with more unique terms, which is why it is less suitable for matching larger documents. This only applies to the dtm, not the ref_dtm. Thus, if your goal is to match smaller documents such as event descriptions to news, this function might be usefull.
create_queries( dtm, ref_dtm = NULL, min_docfreq = 2, max_docprob = 0.01, weight = c("tfidf", "binary"), norm_weight = c("max", "doc_max", "dtm_max", "none"), min_obs_exp = NA, union_sim_thres = NA, combine_all = T, only_dtm_combs = T, use_dtm_and_ref = F, verbose = F )
create_queries( dtm, ref_dtm = NULL, min_docfreq = 2, max_docprob = 0.01, weight = c("tfidf", "binary"), norm_weight = c("max", "doc_max", "dtm_max", "none"), min_obs_exp = NA, union_sim_thres = NA, combine_all = T, only_dtm_combs = T, use_dtm_and_ref = F, verbose = F )
dtm |
A quanteda dfm |
ref_dtm |
Optionally, another quanteda dfm. If given, the ref_dtm will be used to calculate the docfreq/docprob scores. |
min_docfreq |
The minimum frequency for terms or combinations of terms |
max_docprob |
The maximum probability (document frequency / N) for terms or combinations of terms |
weight |
Determine how to weight the queries (if ref_dtm is used, uses the idf of the ref_dtm, or of both the dtm and ref dtm if use_dtm_and_ref is T). Default is "binary" (does/does not occur). "tfidf" uses common tf-idf weighting (actually just idf, since scores are binary). |
norm_weight |
Normalize the weight score so that the highest value is 1. If "max" is used, max is the highest possible value. "doc_max" uses the highest value within each document, and "dtm_max" uses the highest observed value in the dtm. |
min_obs_exp |
The minimum ratio of the observed and expected frequency of a term combination |
union_sim_thres |
If given, a number between 0 and 1, used as the cosine similarity threshold for combining clusters of terms |
combine_all |
If True, combine all terms. If False (default), terms that are included as unigrams (i.e. that are within the min_docfreq and max_docprob) are not combined with other terms. |
only_dtm_combs |
Only include term combinations that occur in dtm. This makes sense (and saves a lot of memory) if you are only interested in assymetric similarity measures based on the query |
use_dtm_and_ref |
if a ref_dtm is used, the weight is computed based only on the document frequencies in the ref dtm. If use_dtm_and_ref is set to TRUE, both the dtm and ref_dtm are used. |
verbose |
If true, report progress |
The main purpose of the function is that it intersects the terms in a dtm based to increase sparsity. This can improve certain document matching tasks, but at the cost of creating a bigger dtm. If all terms are combined this would be a quadratic increase of columns. However, only term combinations that occur in dtm (not ref_dtm) will be used. This is not a problem as long as the similarity of the documents in dtm to documents in dtm_y is calculated as an assymetric similarity measure (i.e. in which the sum of terms in dtm_y is not used).
To emphasize that this feature preparation step is geared towards the task of 'looking up' documents, we use the terminolog of a 'query'. The output of the function is a list of two dtm: query_dtm and ref_dtm. Both dtms have the exact same columns that contain the query terms. The values in query_dtm are by default tfidf weighted, and the values in ref_dtm are binary.
Several options are given to only create term combinations that are informative. Firstly, a minimum and maximum document frequency of term combinations can be defined. Secondly, a minimum observed/expected ratio can be given. The expected probability of a combination of term A and term B is the joint probability. If the observed probability is not higher, the combination is not more informative than chance. Thirdly, before intersecting terms, one can first cluster very similar terms together as single columns to reduce the number of possible combinations.
a list with a query dtm and ref_dtm. Designed for use in compare_documents
using the special 'query_lookup' measure
q = create_queries(rnewsflow_dfm, min_docfreq = 2, union_sim_thres = 0.9, max_docprob = 0.05, verbose = FALSE) head(colnames(q$query_dtm),100)
q = create_queries(rnewsflow_dfm, min_docfreq = 2, union_sim_thres = 0.9, max_docprob = 0.05, verbose = FALSE) head(colnames(q$query_dtm),100)
Delete duplicate (or similar) documents from a document term matrix. Duplicates are defined by: having high content similarity, occuring within a given time distance and being published by the same source.
delete_duplicates( dtm, date_var = NULL, hour_window = c(-24, 24), group_var = NULL, measure = c("cosine", "overlap_pct"), similarity = 1, keep = "first", tf_idf = FALSE, dup_csv = NULL, verbose = F )
delete_duplicates( dtm, date_var = NULL, hour_window = c(-24, 24), group_var = NULL, measure = c("cosine", "overlap_pct"), similarity = 1, keep = "first", tf_idf = FALSE, dup_csv = NULL, verbose = F )
dtm |
A quanteda dfm. |
date_var |
The name of the column in docvars(dtm) that specifies the document date. The values should be of type POSIXlt or POSIXct |
hour_window |
A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours. |
group_var |
Optionally, column name in docvars(dtm) that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared. |
measure |
The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), and the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document). |
similarity |
A threshold for similarity. Documents of which similarity is equal or higher are deleted |
keep |
A character indicating whether to keep the 'first' or 'last' published of duplicate documents. |
tf_idf |
If TRUE, weight the dtm with tf_idf before comparing documents. The original (non-weighted) DTM is returned. |
dup_csv |
Optionally, a path for writing a csv file with the duplicates edgelist. For each duplicate pair it is noted if "from" or "to" is the duplicate, or if "both" are duplicates (of other documents) |
verbose |
If TRUE, report progress |
Note that this can also be used to delete "updates" of articles (e.g., on news sites, news agencies). This should be considered if the temporal order of publications is relevant for the analysis.
A dtm with the duplicate documents deleted
## example with very low similarity threshold (normally not recommended!) dtm2 = delete_duplicates(rnewsflow_dfm, similarity = 0.5, keep='first', tf_idf = TRUE)
## example with very low similarity threshold (normally not recommended!) dtm2 = delete_duplicates(rnewsflow_dfm, similarity = 0.5, keep='first', tf_idf = TRUE)
This is a convenience function for visualizing directed networks with edge labels using plot.igraph. It was designed specifically for visualizing aggregated document similarity networks in the RNewsflow package, but works with any network in the igraph class.
directed_network_plot( g, weight_var = "from.Vprop", weight_thres = NULL, delete_isolates = FALSE, vertex.size = 30, vertex.color = "lightblue", vertex.label.color = "black", vertex.label.cex = 0.7, edge.color = "grey", show.edge.labels = TRUE, edge.label.color = "black", edge.label.cex = 0.6, edge.arrow.size = 1, layout = igraph::layout.davidson.harel, ... )
directed_network_plot( g, weight_var = "from.Vprop", weight_thres = NULL, delete_isolates = FALSE, vertex.size = 30, vertex.color = "lightblue", vertex.label.color = "black", vertex.label.cex = 0.7, edge.color = "grey", show.edge.labels = TRUE, edge.label.color = "black", edge.label.cex = 0.6, edge.arrow.size = 1, layout = igraph::layout.davidson.harel, ... )
g |
A network/graph in the igraph class |
weight_var |
The edge attribute that is used to specify the edges |
weight_thres |
A threshold for weight. Edges below the threshold are ignored |
delete_isolates |
If TRUE, isolates (i.e. vertices without edges) are ignored. |
vertex.size |
The size of the verticex/nodes. Defaults to 30. Can be a vector with values per vertex. |
vertex.color |
Color of vertices/nodes. Default is lightblue. Can be a vector with values per vertex. |
vertex.label.color |
Color of labels for vertices/nodes. Defaults to black. Can be a vector with values per vertex. |
vertex.label.cex |
Size of the labels for vertices/nodes. Defaults to 0.7. Can be a vector with values per vertex. |
edge.color |
Color of the edges. Defaults to grey. Can be a vector with values per edge. |
show.edge.labels |
Logical. Should edge labels be displayed? Default is TRUE. |
edge.label.color |
Color of the edge labels. Defaults to black. Can be a vector with values per edge. |
edge.label.cex |
Size of the edge labels. Defaults to 0.6. Can be a vector with values per edge. |
edge.arrow.size |
Size of the edge arrows. Defaults to 1. Can only be set globally (igraph might update this at some point) |
layout |
The igraph layout used to plot the network. Defaults to layout.davidson.harel |
... |
Arguments to be passed to the plot.igraph function. |
Nothing
data(docnet) aggdocnet = network_aggregate(docnet, by='source') directed_network_plot(aggdocnet, weight_var = 'to.Vprop', weight_thres = 0.2)
data(docnet) aggdocnet = network_aggregate(docnet, by='source') directed_network_plot(aggdocnet, weight_var = 'to.Vprop', weight_thres = 0.2)
Document similarity network for one news agency, and the print and online editions of two newspapers
docnet: A network/graph in the igraph class as created with create_document_network or newsflow_compare.
Visualize (a subcomponent) of the document similarity network
document_network_plot( g, date_attribute = "date", source_attribute = "source", subcomp_i = NULL, dtm = NULL, sources = NULL, only_outer_date = FALSE, date_format = "%Y-%m-%d %H:%M", margins = c(5, 8, 1, 13), isolate_color = NULL, source_loops = TRUE, ... )
document_network_plot( g, date_attribute = "date", source_attribute = "source", subcomp_i = NULL, dtm = NULL, sources = NULL, only_outer_date = FALSE, date_format = "%Y-%m-%d %H:%M", margins = c(5, 8, 1, 13), isolate_color = NULL, source_loops = TRUE, ... )
g |
A document similarity network, as created with newsflow_compare or create_document_network |
date_attribute |
The label of the vertex/document date attribute. Default is "date" |
source_attribute |
The label of the vertex/document source attribute. Default is "source" |
subcomp_i |
Optional. If an integer is given, the network is decomposed into subcomponents (i.e. unconnected components) and only the i-th component is visualized. |
dtm |
Optional. If a document-term matrix that contains the documents in g is given, a wordcloud with the most common words of the network is added. |
sources |
Optional. Use a character vector to select only certain sources |
only_outer_date |
If TRUE, only the labels for the first and last date are reported on the x-axis |
date_format |
The date format of the date labels (see format.POSIXct) |
margins |
The margins of the network plot. The four values represent bottom, left, top and right margin. |
isolate_color |
Optional. Set a custom color for isolates |
source_loops |
If set to FALSE, all edges between vertices/documents of the same source are ignored. |
... |
Additional arguments for the network plotting function plot.igraph |
Nothing.
docnet = docnet dtm = rnewsflow_dfm docnet_comps = igraph::decompose.graph(docnet) # get subcomponents # subcomponent 1 document_network_plot(docnet_comps[[1]]) # subcomponent 2 with wordcloud document_network_plot(docnet_comps[[2]], dtm=dtm) # subcomponent 3 with additional arguments passed to plot.igraph document_network_plot(docnet_comps[[3]], dtm=dtm, vertex.color='red')
docnet = docnet dtm = rnewsflow_dfm docnet_comps = igraph::decompose.graph(docnet) # get subcomponents # subcomponent 1 document_network_plot(docnet_comps[[1]]) # subcomponent 2 with wordcloud document_network_plot(docnet_comps[[2]], dtm=dtm) # subcomponent 3 with additional arguments passed to plot.igraph document_network_plot(docnet_comps[[3]], dtm=dtm, vertex.color='red')
The 'filter_window' function can be used to filter the document pairs (i.e. edges) using the 'hour_window' parameter, which works identical to the 'hour_window' parameter in the 'newsflow_compare' function. In addition, the 'from_vertices' and 'to_vertices' parameters can be used to select the vertices (i.e. documents) for which this filter is applied.
filter_window(g, hour_window, to_vertices = NULL, from_vertices = NULL)
filter_window(g, hour_window, to_vertices = NULL, from_vertices = NULL)
g |
A document similarity network, as created with newsflow_compare or create_document_network |
hour_window |
A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours. |
to_vertices |
A filter to select the vertices 'to' which an edge is filtered. For example, if 'V(g)$sourcetype == "newspaper"' is used, then the hour_window filter is only applied for edges 'to' newspaper documents (specifically, where the sourcetype attribute is "newspaper"). |
from_vertices |
A filter to select the vertices 'from' which an edge is filtered. Works identical to 'to_vertices'. |
It is recommended to use the show_window function to verify whether the hour windows are correct according to the assumptions and focus of the study.
A network/graph in the igraph class
data(docnet) show_window(docnet, to_attribute = 'source') # before filtering docnet = filter_window(docnet, hour_window = c(0.1,24)) docnet = filter_window(docnet, hour_window = c(6,36), to_vertices = V(docnet)$sourcetype == 'Print NP') show_window(docnet, to_attribute = 'sourcetype') # after filtering per sourcetype show_window(docnet, to_attribute = 'source') # after filtering per source
data(docnet) show_window(docnet, to_attribute = 'source') # before filtering docnet = filter_window(docnet, hour_window = c(0.1,24)) docnet = filter_window(docnet, hour_window = c(6,36), to_vertices = V(docnet)$sourcetype == 'Print NP') show_window(docnet, to_attribute = 'sourcetype') # after filtering per sourcetype show_window(docnet, to_attribute = 'source') # after filtering per source
View term scores for a given document
get_doc_terms(dtm, docname = NULL, doc_i = NULL)
get_doc_terms(dtm, docname = NULL, doc_i = NULL)
dtm |
A quanteda dfm |
docname |
name of document to select |
doc_i |
alternatively, select document by index |
A named vector with terms (names) and scores
get_doc_terms(rnewsflow_dfm, doc_i=1)
get_doc_terms(rnewsflow_dfm, doc_i=1)
View overlapping terms for a given pair of documents
get_overlap_terms(dtm, doc.x, doc.y, dtm.y = dtm)
get_overlap_terms(dtm, doc.x, doc.y, dtm.y = dtm)
dtm |
A quanteda dfm |
doc.x |
The name of the first document in dtm |
doc.y |
The name of the second document in dtm (or dtm.y) |
dtm.y |
Optionally, a second dtm (for when the documents occur in separate dtm's) |
A character vector
get_overlap_terms(rnewsflow_dfm, quanteda::docnames(rnewsflow_dfm)[1], quanteda::docnames(rnewsflow_dfm)[5])
get_overlap_terms(rnewsflow_dfm, quanteda::docnames(rnewsflow_dfm)[1], quanteda::docnames(rnewsflow_dfm)[5])
If it can be assumed that matches should only occur within a given time range (e.g., event data should match news items after the event occured) a low effort validation can be obtained by looking at whether the matches only occur within this time range. This function plots the percentage of matches within a given time range (hourdiff) for different thresholds of the weight column. This can be used to determine a good threshold.
hourdiff_range_thresholds( g, breaks = 20, hourdiff_range = c(0, Inf), min_weight = NA, min_hourdiff = NA, max_hourdiff = NA )
hourdiff_range_thresholds( g, breaks = 20, hourdiff_range = c(0, Inf), min_weight = NA, min_hourdiff = NA, max_hourdiff = NA )
g |
The output of newsflow.compare (either as "igraph" or "edgelist") |
breaks |
The number of breaks for the weight threshold |
hourdiff_range |
The time period (hourdiff range) in which the match 'should' occur. |
min_weight |
Optionally, filter out all value below the given weight |
min_hourdiff |
the lowest possible hourdiff value. This is used to estimate noise. If not specified, will be estimated based on data. |
max_hourdiff |
the highest possible hourdiff value. |
Nothing... just plots
This function offers a versatile way to aggregate the edges of a network based on the vertex attributes. Although it was designed specifically for document similarity networks, it can be used for any network in the igraph class.
network_aggregate( g, by = NULL, by_from = by, by_to = by, edge_attribute = "weight", agg_FUN = mean, return_df = FALSE, keep_isolates = T )
network_aggregate( g, by = NULL, by_from = by, by_to = by, edge_attribute = "weight", agg_FUN = mean, return_df = FALSE, keep_isolates = T )
g |
A network/graph in the igraph class |
by |
A character string indicating the vertex attributes by which the edges will be aggregated. |
by_from |
Optionally, specify different vertex attributes to aggregate the 'from' side of edges |
by_to |
Optionally, specify different vertex attributes to aggregate the 'to' side of edges |
edge_attribute |
Select an edge attribute to aggregate using the function specified in ‘agg_FUN'. Defaults to ’weight' |
agg_FUN |
The function used to aggregate the edge attribute |
return_df |
Optional. If TRUE, the results are returned as a data.frame. This can in particular be convenient if by_from and by_to are used. |
keep_isolates |
if True, also return scores for isolates |
The first argument is the network (in the 'igraph' class). The second argument, for the 'by' parameter, is a character vector to indicate one or more vertex attributes based on which the edges are aggregated. Optionally, the 'by' parameter can also be specified separately for 'by_from' and 'by_to'.
By default, the function returns the aggregated network as an igraph class. The edges in the aggregated network have five standard attributes. The 'edges' attribute counts the number of edges from the 'from' group to the 'to' group. The 'from.V' attribute shows the number of vertices in the 'from' group that matched with a vertex in the 'to' group. The 'from.Vprop attribute shows this as the proportion of all vertices in the 'from' group. The 'to.V' and 'to.Vprop' attributes show the same for the 'to' group.
In addition, one of the edge attributes of the original network can be aggregated with a given function. These are specified in the 'edge_attribute' and 'agg_FUN' parameters.
A network/graph in the igraph class, or a data.frame if return_df is TRUE.
data(docnet) aggdocnet = network_aggregate(docnet, by='sourcetype') igraph::get.data.frame(aggdocnet, 'both') aggdocdf = network_aggregate(docnet, by_from='sourcetype', by_to='source', return_df = TRUE) head(aggdocdf)
data(docnet) aggdocnet = network_aggregate(docnet, by='sourcetype') igraph::get.data.frame(aggdocnet, 'both') aggdocdf = network_aggregate(docnet, by_from='sourcetype', by_to='source', return_df = TRUE) head(aggdocdf)
This is a wrapper for the compare_documents
function, specialised for the case of analyzing documents over time.
The difference is that using date_var is mandatory, and the output is returned as an igraph network (using as_document_network
).
newsflow_compare( dtm, dtm_y = NULL, date_var = "date", hour_window = c(-24, 24), group_var = NULL, measure = c("cosine", "overlap_pct", "overlap", "dot_product", "softcosine"), tf_idf = F, min_similarity = 0, n_topsim = NULL, only_complete_window = T, ... )
newsflow_compare( dtm, dtm_y = NULL, date_var = "date", hour_window = c(-24, 24), group_var = NULL, measure = c("cosine", "overlap_pct", "overlap", "dot_product", "softcosine"), tf_idf = F, min_similarity = 0, n_topsim = NULL, only_complete_window = T, ... )
dtm |
A quanteda dfm. Note that it is common to first weight the dtm(s) before calculating document similarity, For this you can use quanteda's dfm_tfidf and dfm_weight |
dtm_y |
Optionally, another dtm. If given, the documents in dtm will be compared to the documents in dtm_y. |
date_var |
The name of the column in meta that specifies the document date. default is "date". The values should be of type POSIXct, or coercable with as.POSIXct. If given, the hour_window argument is used to only compare documents within a time window. |
hour_window |
A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours. It is possible to specify time windows down to the level of seconds by using fractions (hours / 60 / 60). |
group_var |
Optionally, The name of the column in meta that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared. |
measure |
The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document), "overlap" (like overlap_pct, but as the sum of overlap instead of the percentage) and the symmetrical soft cosine measure (experimental). The regular dot product (dot_product) is also supported. |
tf_idf |
If TRUE, weigh the dtm (and dtm_y) by term frequency - inverse document frequency. For more control over weighting, we recommend using quanteda's dfm_tfidf or dfm_weight on dtm and dtm_y. |
min_similarity |
A threshold for similarity. lower values are deleted. For all available similarity measures zero means no similarity. |
n_topsim |
An alternative or additional sort of threshold for similarity. Only keep the [n_topsim] highest similarity scores for x. Can return more than [n_topsim] similarity scores in the case of duplicate similarities. |
only_complete_window |
If True, only compare articles (x) of which a full window of reference articles (y) is available. Thus, for the first and last [window.size] days, there will be no results for x. |
... |
Other arguments passed to |
An igraph network.
dtm = quanteda::dfm_tfidf(rnewsflow_dfm) el = newsflow_compare(dtm, date_var='date', hour_window = c(0.1, 36))
dtm = quanteda::dfm_tfidf(rnewsflow_dfm) el = newsflow_compare(dtm, date_var='date', hour_window = c(0.1, 36))
Transforms the network so that a document only has an edge to the earliest dated document it matches within the specified time window[^duplicate].
only_first_match(g)
only_first_match(g)
g |
A document similarity network, as created with newsflow_compare or create_document_network |
If there are multiple earliest dated documents (that is, having the same publication date) then edges to all earliest dated documents are kept.
A network/graph in the igraph class
data(docnet) subcomp1 = igraph::decompose.graph(docnet)[[2]] subcomp2 = only_first_match(subcomp1) igraph::get.data.frame(subcomp1) igraph::get.data.frame(subcomp2) graphics::par(mfrow=c(2,1)) document_network_plot(subcomp1, main='All matches') document_network_plot(subcomp2, main='Only first match') graphics::par(mfrow=c(1,1))
data(docnet) subcomp1 = igraph::decompose.graph(docnet)[[2]] subcomp2 = only_first_match(subcomp1) igraph::get.data.frame(subcomp1) igraph::get.data.frame(subcomp2) graphics::par(mfrow=c(2,1)) document_network_plot(subcomp1, main='All matches') document_network_plot(subcomp2, main='Only first match') graphics::par(mfrow=c(1,1))
quanteda dfm for RNewsflow vignette demo
rnewsflow_dfm
rnewsflow_dfm
dfm
This function aggregates the edges for all combinations of attributes specified in 'from_attribute' and 'to_attribute', and shows the minimum and maximum hour difference for each combination.
show_window(g, to_attribute = NULL, from_attribute = NULL)
show_window(g, to_attribute = NULL, from_attribute = NULL)
g |
A document similarity network, as created with newsflow_compare or create_document_network |
to_attribute |
The vertex attribute to aggregate the 'to' group of the edges |
from_attribute |
The vertex attribute to aggregate the 'from' group of the edges |
The filter_window function can be used to filter edges that fall outside of the intended time window.
A data.frame showing the left and right edges of the window for each unique group.
data(docnet) show_window(docnet, to_attribute = 'source') show_window(docnet, to_attribute = 'sourcetype') show_window(docnet, to_attribute = 'sourcetype', from_attribute = 'sourcetype')
data(docnet) show_window(docnet, to_attribute = 'source') show_window(docnet, to_attribute = 'sourcetype') show_window(docnet, to_attribute = 'sourcetype', from_attribute = 'sourcetype')
This function (including the underlying cpp function batched_tcrossprod_cpp) is the workhorse of the RNewsflow package. It has unnervingly many arguments for a tcrossprod because it needs to be able to do many thing efficiently. While its mostly a backend function, we expose it because it has applications outside of RNewsflow, but we make no excuses for the fact that readability is very much sacrificed here for the convenience of being able to keep adding features that we need for RNewsflow.
tcrossprod_sparse( m, m2 = NULL, min_value = NULL, max_value = NULL, only_upper = F, diag = T, top_n = NULL, rowsum_div = F, max_p = 1, pvalue = c("disparity", "normal", "lognormal", "nz_normal", "nz_lognormal"), normalize = c("none", "l2", "softl2"), crossfun = c("prod", "min", "softprod", "maxproduct", "lookup", "cp_lookup", "cp_lookup_norm"), group = NULL, group2 = NULL, date = NULL, date2 = NULL, lwindow = -1, rwindow = 1, date_unit = c("days", "hours", "minutes", "seconds"), simmat = NULL, simmat_thres = NULL, row_attr = F, col_attr = F, lag_attr = F, batchsize = 1000, verbose = F )
tcrossprod_sparse( m, m2 = NULL, min_value = NULL, max_value = NULL, only_upper = F, diag = T, top_n = NULL, rowsum_div = F, max_p = 1, pvalue = c("disparity", "normal", "lognormal", "nz_normal", "nz_lognormal"), normalize = c("none", "l2", "softl2"), crossfun = c("prod", "min", "softprod", "maxproduct", "lookup", "cp_lookup", "cp_lookup_norm"), group = NULL, group2 = NULL, date = NULL, date2 = NULL, lwindow = -1, rwindow = 1, date_unit = c("days", "hours", "minutes", "seconds"), simmat = NULL, simmat_thres = NULL, row_attr = F, col_attr = F, lag_attr = F, batchsize = 1000, verbose = F )
m |
A CsparseMatrix |
m2 |
A CsparseMatrix |
min_value |
Optionally, a numerical value, specifying the threshold for including a score in the output. |
max_value |
Optionally, a numerical value for the upper limit for including a score in the output. |
only_upper |
If true, only the upper triangle of the matrix is returned. Only possible for symmetrical output (m and m2 have same number of columns) |
diag |
If false, the diagonal of the matrix is not returned. Only possible for symmetrical output (m and m2 have same number of columns) |
top_n |
An integer, specifying the top number of strongest similarities per row. So, for each row in m at most top_n scores are returned.. |
rowsum_div |
If true, divide crossproduct by column sums of m. (this has to happen within the loop for min_value and top_n filtering). |
max_p |
A threshold for maximium p value. |
pvalue |
If max_p < 1, edges are removed based on a p value. For each document in dtm, a p value is calculated over its outward edges. Default is the p-value based on uniform distribution, akin to a "disparity" filter (see Serrano et al., DOI: 10.1073/pnas.0808904106) but without filtering on inward edges. |
normalize |
Normalize rows by a given norm score (before calculating similarity). Default is 'none' (no normalization). 'l2' is the l2 norm (use in combination with 'prod' crossfun for cosine similarity). 'l2soft' is the adaptation of l2 for soft similarity (use in combination with 'softprod' crossfun for soft cosine). |
crossfun |
The function used in the vector operations. Normally this is the "prod", for product (dot product). Here we also allow the "min", for minimum value. We use this in our document overlap_pct score. In addition, there is the (experimental) softprod, that can be used in combination with softl2 normalization to get the soft cosine similarity. The "maxproduct" is a special case used in the query_lookup measure, that uses product but only returns the score of the strongest matching term. The "cp_lookup" and "cp_lookup_norm" are special cases for conditional probability sensitive lookup. |
group |
Optionally, a character vector that specifies a group (e.g., source) for each row in m. If given, only pairs of rows with the same group are calculated. |
group2 |
If m2 and group are used, group2 has to be used to specify the groups for the rows in m2 (otherwise group will be ignored) |
date |
Optionally, a POSIXct vector (or a vector that can be converted to as.POSIXct) that specifies a date for each row in m. If given, only pairs of rows within a given date range (see lwindow, rwindow and date_unit) are calculated. |
date2 |
If m2 and date are used, date2 has to be used to specify the date for the rows in m2 (otherwise date will be ignored) |
lwindow |
If date (and date2) are used, lwindow determines the left side of the date window. e.g. -10 means that rows are only matched with rows for which date is within 10 [date_units] before. |
rwindow |
Like lwindow, but for the right side. e.g. an lwindow of -1 and rwindow of 1, with date_unit is "days", means that only rows are matched for which the dates are within a 1 day distance |
date_unit |
The date unit used in lwindow and rwindow. Supports "days", "hours", "minutes" and "seconds". Note that refers to the time distance between two rows ("days" doesn't refer to calendar days, but to a time of 24 hours) |
simmat |
If softcos is used, a symmetric matrix with terms that indicates the similarity of terms (i.e. adjacency matrix). If NULL, a cosine similarity matrix will be created on the go |
simmat_thres |
If softcos is used, a threshold for the term similarity. |
row_attr |
If TRUE, add the "row_n" and "row_sum" elements to the "margin" attribute. |
col_attr |
Like row_attr, but adding "col_n" and "col_sum" to the "margin" attribute. |
lag_attr |
If TRUE, adds "lag_n" and "lag_sum" to the "margin" attribute. These are the margin scores for rows, where the date of the column is before (lag) the date of the row. Only possible if date argument is given. |
batchsize |
If group and/or date are used, size of batches. |
verbose |
if TRUE, report progress |
Enables limiting row combinations to within specified groups and date windows, and filters results that do not pass the threshold on the fly. To achieve this, options for similarity measures are included in the function. For example, to get the cosine similarity, you can normalize with "l2" and use the "prod" (product) function for the
This function is called by the document comparison functions (newsflow_compare, delete_duplicates). We only expose it here for additional flexibility, and because it could be usefull outside of the purpose of this package.
The output matrix also has an attribute "margin", which contains margin scores (e.g., row_sum) if the row_attr or col_attr arguments are used. The reason for including this is that some values that are normally available in the output of a cross product are broken if certain filter options are used. If group or date is used, we don't know how many columns a rows has been compared to (normally this is all columns). If a min/max or top_n filter is used, we don't know the true row sums (and thus row means).
A CsparseMatrix
set.seed(1) m = Matrix::rsparsematrix(5,10,0.5) tcrossprod_sparse(m, min_value = 0, only_upper = FALSE, diag = TRUE) tcrossprod_sparse(m, min_value = 0, only_upper = FALSE, diag = FALSE) tcrossprod_sparse(m, min_value = 0, only_upper = TRUE, diag = FALSE) tcrossprod_sparse(m, min_value = 0.2, only_upper = TRUE, diag = FALSE) tcrossprod_sparse(m, min_value = 0, only_upper = TRUE, diag = FALSE, top_n = 1)
set.seed(1) m = Matrix::rsparsematrix(5,10,0.5) tcrossprod_sparse(m, min_value = 0, only_upper = FALSE, diag = TRUE) tcrossprod_sparse(m, min_value = 0, only_upper = FALSE, diag = FALSE) tcrossprod_sparse(m, min_value = 0, only_upper = TRUE, diag = FALSE) tcrossprod_sparse(m, min_value = 0.2, only_upper = TRUE, diag = FALSE) tcrossprod_sparse(m, min_value = 0, only_upper = TRUE, diag = FALSE, top_n = 1)
A quick, language agnostic way for finding terms with similar spelling. Calculates similarity as percentage of a terms bigram's or trigram's that also occur in the other term. The percentage has to be above the given threshold for both terms (unless allow_asym = T)
term_char_sim( voc, type = c("tri", "bi"), min_overlap = 2/3, max_diff = 4, pad = F, as_lower = T, same_start = 1, drop_non_alpha = T, min_length = 5, allow_asym = F, verbose = T )
term_char_sim( voc, type = c("tri", "bi"), min_overlap = 2/3, max_diff = 4, pad = F, as_lower = T, same_start = 1, drop_non_alpha = T, min_length = 5, allow_asym = F, verbose = T )
voc |
A character vector that gives the vocabulary (e.g., colnames of a dtm) |
type |
Either "bi" (bigrams) or "tri" (trigrams) |
min_overlap |
The minimal overlap percentage. Works together with max_diff to determine required overlap |
max_diff |
The maximum number of bi/tri-grams that is different |
pad |
If True, pad the left size (ls) and right side (rs) of bi/tri-grams. So, trigrams for "pad" would be: "ls_ls_p", "ls_p_a", "p_a_d", "a_d_rs", "d_rs_rs". |
as_lower |
If True, ignore case |
same_start |
Should terms start with the same character(s)? Given as a number for the number of same characters. (also greatly speeds up calculation) |
drop_non_alpha |
If True, ignore non alpha terms (e.g., numbers, punctuation). They will appear in the output matrix, but only with zeros. |
min_length |
The minimum number of characters in a term. Terms with fewer characters are ignored. They will appear in the output matrix, but only with zeros. |
allow_asym |
If True, the match only needs to be true for at least one term. In practice, this means that "America" would match perfectly with "Southern-America". |
verbose |
If True, report progress |
A similarity matrix in the CsparseMatrix format
dfm = quanteda::tokens(c('That guy Gadaffi','Do you mean Kadaffi?', 'Nah more like Gadaffel','What Gargamel?')) |> quanteda::dfm() simmat = term_char_sim(colnames(dfm), same_start=0) term_union(dfm, simmat, verbose = FALSE)
dfm = quanteda::tokens(c('That guy Gadaffi','Do you mean Kadaffi?', 'Nah more like Gadaffel','What Gargamel?')) |> quanteda::dfm() simmat = term_char_sim(colnames(dfm), same_start=0) term_union(dfm, simmat, verbose = FALSE)
Calculate statistics for term occurence across days
term_day_dist(dtm, meta = NULL, date.var = "date")
term_day_dist(dtm, meta = NULL, date.var = "date")
dtm |
A quanteda dfm. Alternatively, a DocumentTermMatrix from the tm package can be used, but then the meta parameter needs to be specified manually |
meta |
If dtm is a quanteda dfm, docvars(meta) is used by default (meta is NULL) to obtain the meta data. Otherwise, the meta data.frame has to be given by the user, with the rows of the meta data.frame matching the rows of the dtm (i.e. each row is a document) |
date.var |
The name of the meta column specifying the document date. default is "date". The values should be of type POSIXlt or POSIXct |
A data.frame with statistics for each term.
tdd = term_day_dist(rnewsflow_dfm, date.var='date') head(tdd) tail(tdd)
tdd = term_day_dist(rnewsflow_dfm, date.var='date') head(tdd) tail(tdd)
For each term in m, the usage before and after the document date is compared (with a chi2 test) to see whether usage increased.
term_innovation( m, date, m2 = NULL, date2 = NULL, lwindow = -7, rwindow = 7, date_unit = c("days", "hours", "minutes", "seconds"), min_chi = 5.024, min_ratio = 2, smooth = 1 )
term_innovation( m, date, m2 = NULL, date2 = NULL, lwindow = -7, rwindow = 7, date_unit = c("days", "hours", "minutes", "seconds"), min_chi = 5.024, min_ratio = 2, smooth = 1 )
m |
A CsparseMatrix |
date |
a character vector that specifies a date for each row in m. If given, only pairs of rows within a given date range (see lwindow, rwindow and date_unit) are calculated. |
m2 |
Optionally, use a different matrix for calculating the innovation scores. For example, if m is a DTM of press releases, m2 can be a DTM of news articles, to see if term usage increased in the news after the press release. |
date2 |
If m2 is used, date2 has to be used to specify the date for the rows in m2 (otherwise date will be ignored) |
lwindow |
If date (and date2) are used, lwindow determines the left side of the date window. e.g. -10 means that rows are only matched with rows for which date is within 10 [date_units] before. |
rwindow |
Like lwindow, but for the right side. e.g. an lwindow of -1 and rwindow of 1, with date_unit is "days", means that only rows are matched for which the dates are within a 1 day distance |
date_unit |
The date unit used in lwindow and rwindow. Supports "days", "hours", "minutes" and "seconds". Note that refers to the time distance between two rows ("days" doesn't refer to calendar days, but to a time of 24 hours) |
min_chi |
The minimum chi-square value |
min_ratio |
The minimum ratio (rwindow score / lwindow score) |
smooth |
The smoothing factor (prevents -Inf/Inf ratio) |
A CsparseMatrix
Given a dtm and a similarity (adjacency) matrix, create a new column for each nonzero cell in the similarity matrix. For the term combinations (everything except the diagonal) the column names will be pasted together with a "&" separator (read as AND)
term_intersect(dtm, simmat, as_dfm = T, verbose = F, sep = " & ", par = NA)
term_intersect(dtm, simmat, as_dfm = T, verbose = F, sep = " & ", par = NA)
dtm |
A quanteda dfm or a CsparseMatrix. |
simmat |
A similarity matrix in CsparseMatrix format. For instance, created with term_char_sim |
as_dfm |
If True, return as quanteda dfm |
verbose |
If True, report progress |
sep |
The separator used for pasting the terms |
par |
If TRUE, add parentheses to colnames before combining. This is mainly for internal use, as it allows specification if OR (term_union) and AND (term_intersect) operations are combined. If NA, this is based on whether parenthese are present. |
A CsparseMatrix or quanteda dfm
Given a dtm and a similarity (adjacency) matrix, group clusters of similar terms (simmat > 0) into a single column. Column names will be concatenated, with a "|" seperator (read as OR)
term_union(dtm, simmat, as_dfm = T, verbose = F, sep = "|", par = NA)
term_union(dtm, simmat, as_dfm = T, verbose = F, sep = "|", par = NA)
dtm |
A quanteda dfm or a CsparseMatrix. |
simmat |
A similarity matrix in CsparseMatrix format. For instance, created with term_char_sim |
as_dfm |
If True, return as quanteda dfm |
verbose |
If True, report progress |
sep |
The separator used for pasting the terms |
par |
If TRUE, add parentheses to colnames before combining. This is mainly for internal use, as it allows specification if OR (term_union) and AND (term_intersect) operations are combined. If NA, this is based on whether parenthese are present. |
A CsparseMatrix or quanteda dfm
dfm = quanteda::tokens(c('That guy Gadaffi','Do you mean Kadaffi?', 'Nah more like Gadaffel','Not Kadaffel?')) |> quanteda::dfm() simmat = term_char_sim(colnames(dfm), same_start=0) term_union(dfm, simmat, verbose = FALSE)
dfm = quanteda::tokens(c('That guy Gadaffi','Do you mean Kadaffi?', 'Nah more like Gadaffel','Not Kadaffel?')) |> quanteda::dfm() simmat = term_char_sim(colnames(dfm), same_start=0) term_union(dfm, simmat, verbose = FALSE)