Import citation data
Finding communities
Ranking communities
- Inter-field ranking
- Intra-field ranking
Save results

This vignette performs and documents the analysis for the visualisation Measuring influence in scientific fields, available online. Unfortunately the raw data is not freely available, so only the methodology may be published for now.

Looking for the D3.js visualisation code? See http://selbydavid.com/influence/influence.js

We will use the following packages.

library(BradleyTerryScalable) # 0.1.0
library(dplyr)
library(ggplot2); theme_set(theme_bw())
library(ggrepel)
library(igraph)
library(qvcalc)
library(tidyr)

Import citation data

Citation data has been provided by Thomson Reuters/Clarivate Analytics. It includes data from Journal Citation Reports (JCR) year 2015.

JCR_raw <- read.table('data/2015_Cited.txt', stringsAsFactors = FALSE,
                      col.names = c('Cited', 'Citing', 'AllYears', 2014:2005, 'Earlier'))

We have 14 columns but are only interested in the identities of the citing and cited journals and the number of citations issued in 2015 to publications from the preceding 10 years (2005–2014). Any pairs of journals with no citations in this period may be omitted.

JCR <-
  JCR_raw %>%
  transmute(Citing, Cited, weight = AllYears - Earlier) %>%
  filter(weight > 0)

We have arranged a data frame into a format suitable for conversion to an igraph object: three columns, corresponding to “source”, “destination” and edge weight.

G <- graph_from_data_frame(JCR)

Journals that give out zero citations or receive none themselves are not really interesting and can cause problems by making the network not to be strongly connected. We will only keep the main strongly-connected component of the graph.

components <- components(G, 'strong')
G <- delete_vertices(G, components$membership != which.max(components$csize))

Finding communities

There are two different techniques we might use to discover communities in the citation network: the directed Louvain algorithm and the map equation.

Respective authors have published software to run each algorithm, but they do not use a common interface; each requires input data to be in a slightly different format. Thus our next task is to format our citation graph into files appropriate for each programme. This is a bit annoying — hopefully in the future we can unify this into a single interface.

Formatting citation data

The following function converts an igraph object into pajek format and writes it to a file. The Louvain algorithm doesn’t require the vertices and arcs to be declared, but the Infomap algorithm does, so we have a slightly different format for each procedure.

write_pajek <- function(g, file = stdout(), louvain = FALSE) {
  nvertices <- igraph::vcount(g)
  output <- paste('*Vertices', nvertices)
  output <- c(output, paste0(1:nvertices, ' "', V(g)$name, '"'))
  output <- c(output, "*Arcs")
  edges <- igraph::ends(g, 1:igraph::ecount(g))
  # Convert vertex names to indices
  edges <- apply(edges, 2, function(x) match(c(x), V(g)$name))
  edges <- apply(edges, 1, paste, collapse = ' ')
  if (louvain) # Only requires `src dest weight`
    output <- NULL
  output <- c(output, paste(edges, E(g)$weight))
  if (is.character(file)) {
    file <- file(file)
    on.exit(close(file))
  }
  writeLines(output, file)
}

Using this function, we write the citations data to disk.

write_pajek(G, 'data/JCR_infomap.net')
write_pajek(G, 'data/JCR_louvain.txt', louvain = TRUE)

Run Louvain algorithm

Now we can run the directed Louvain algorithm. This requires an extra step: first we have to convert from plaintext to the Louvain algorithm’s dedicated binary format. This input is then passed to the Louvain algorithm proper.

# Convert to binary
system('bin/convert -i data/JCR_louvain.txt -o data/JCR_louvain.bin -w data/JCR_louvain.weights')

# Run algorithm
system('./bin/community.exe data/JCR_louvain.bin -w data/JCR_louvain.weights -l -1',
       intern = TRUE, ignore.stderr = TRUE) -> lv_output

The output ‘tree’ in lv_output should now be processed into a suitable data frame. As our input file was indexed from one but C++ indexes from zero, the community.exe programme has introduced a disjoint node, 0 which lives in its own category. We need to omit this. (Although the convert.exe has an option -r that supposedly renumbers nodes from zero, it doesn’t seem to work.)

louvain <- lapply(lv_output, strsplit, ' ') %>%
  unlist() %>% as.numeric() %>%
  matrix(byrow = TRUE, ncol = 2) %>%
  as.data.frame() %>%
  setNames(c('node', 'community')) %>%
  filter(node != 0) %>% # remove nonexistent 0 node, added as an artifact by C++.
  mutate(level = cumsum(node == 1)) # every time node counter resets, go to next level

A useful check: does our output have the same number of journals as the input did?

Tests passed

stopifnot(vcount(G) == nrow(filter(louvain, level == 1)))

There are 122 communities in level 1, 14 communities in level 2, 10 in level 3 and 10 in level 4. The latter corresponds to the maximum modularity solution. The following code walks down the hierarchy and finds the level 4 community to which each original node has been assigned.

louvain_communities <- louvain %>%
  inner_join(filter(louvain, level == 2) %>%
               select(-level, community2 = community),
             by = c('community' = 'node')) %>%
  inner_join(filter(louvain, level == 3) %>%
               select(-level, community3 = community),
             by = c('community2' = 'node')) %>%
  inner_join(filter(louvain, level == 4) %>%
               select(-level, community4 = community),
             by = c('community3' = 'node')) %>%
  filter(level == 1) %>% select(-level)

Now we need to match the communities up with the original journal names. A table of journal abbreviations and full names has been downloaded from the Journal Citation Reports web site.

JCR_abbreviations <- read.csv('data/2015_JCR_abbreviations.csv',
                              skip = 1,
                              stringsAsFactors = FALSE)[, 2:3]

The raw journal names are in a mix of all caps and title case, so we will use some tools to get everything looking a bit more presentable. The function capitaliseJournalTitle() converts a character string to Title Case.

capitaliseJournalTitle <- function(y) {
  y <- tolower(y)
  unname(
    sapply(y, function(x) {
    x <- gsub('-', ': ', x)
    s <- strsplit(x, " ")[[1]]
    paste(toupper(substring(s, 1, 1)), substring(s, 2),
          sep = "", collapse = " ")
  })
  )
}

Now, we join the directed Louvain output with the abbreviated journal titles, which are joined in turn with the full journal titles (converted to Title Case). The resulting data can be written to disk.

louvain_named <- louvain_communities %>%
  mutate(journal = V(G)$name[node]) %>%
  left_join(JCR_abbreviations, by = c('journal' = 'JCR.Abbreviated.Title')) %>%
  mutate(Full.Journal.Title = capitaliseJournalTitle(Full.Journal.Title)) %>%
  distinct(Full.Journal.Title, .keep_all = TRUE) # as left_join makes duplicates

louvain_named %>%
  select(community1 = community,
         community2:community4,
         journal_abbr = journal,
         journal_full = Full.Journal.Title) %>%
  write.csv('output/louvain_communities.csv', row.names = FALSE)

Run Infomap algorithm

Now, we can run the Infomap algorithm and save the output as follows. Adding -2 optimises for a two-level partition (i.e. community, node) without a deeper hierarchy. The -d argument indicates the network has directed edges. The -k argument takes self-loops into account (as they are ignored by default).

We can set the --num-trials with -N. This means we run the outer algorithm N times and pick the run with the best solution. We can set the --seed with -s (for reproducibility).

The programme doesn’t let you specify the output filename; only the folder. To avoid having a convoluted folder structure, we will run the algorithm, saving the file as the same name each time to output/, immediately renaming it using file.rename() to something more sensible.

infomap_output_files <- paste0('output/infomap_',
                               c('undirected', 'directed', 'directed_self'),
                               '.tree')

if (any(!file.exists(infomap_output_files))) { # Only run if not run already
  # Undirected
  system('bin/Infomap data/JCR_infomap.net output -2u -N100 -s2018')
  file.rename('output/JCR_infomap.tree', infomap_output_files[1])
  
  # Directed, ignore self-loops
  system('bin/Infomap data/JCR_infomap.net output -2d -N100 -s2018')
  file.rename('output/JCR_infomap.tree', infomap_output_files[2])
  
  # Directed, consider self-loops
  system('bin/Infomap data/JCR_infomap.net output -2dk -N100 -s2018')
  file.rename('output/JCR_infomap.tree', infomap_output_files[3])
}

stopifnot(!file.exists('output/JCR_infomap.tree')) # check nothing left behind

info_undirected <- read.table(infomap_output_files[1],
                              col.names = c('path', 'flow', 'name', 'node'))
info_directed <- read.table(infomap_output_files[2],
                            col.names = c('path', 'flow', 'name', 'node'))
info_directed_self <- read.table(infomap_output_files[3],
                                 col.names = c('path', 'flow', 'name', 'node'))

Let’s see what the output looks like.

head(info_undirected)

path	flow	name	node
1:1	0.0197050	PLOS ONE	73
1:2	0.0076432	P NATL ACAD SCI USA	283
1:3	0.0060845	SCI REP-UK	46
1:4	0.0060106	NATURE	36
1:5	0.0053352	SCIENCE	307
1:6	0.0039210	J BIOL CHEM	559

The flow column gives us a journal-level PageRank ranking of the network for free. Notice the top few journals are Nature, PNAS, Science, PLoS One et al.

The path is hierarchical, representing the community and location within that community. This can be separated out into columns.

We have three different configurations: undirected, directed and directed with self-loops. The first one is included mainly for completeness, but to compare properly with Louvain we will assume we want one of the directed variants.

Since the hierarchical structure is stored as a string in the path column, our first task is to separate this into multiple columns.

info_undirected <- info_undirected %>%
  separate(path,
           paste0('level', 1:2),
           sep = ':',
           convert = TRUE)
info_directed <- info_directed %>%
  separate(path,
           paste0('level', 1:2),
           sep = ':',
           convert = TRUE)
info_directed_self <- info_directed_self %>%
  separate(path,
           paste0('level', 1:2),
           sep = ':',
           convert = TRUE)
head(info_undirected)

level1	level2	flow	name	node
1	1	0.0197050	PLOS ONE	73
1	2	0.0076432	P NATL ACAD SCI USA	283
1	3	0.0060845	SCI REP-UK	46
1	4	0.0060106	NATURE	36
1	5	0.0053352	SCIENCE	307
1	6	0.0039210	J BIOL CHEM	559

Undirected Infomap returns 70 communities, directed Infomap returns 69 communities and directed Infomap with self-loops returns 450 communities. It looks like we should go for the directed Infomap algorithm, ignoring self-loops.

As with the Louvain algorithm, we need to bind the community membership structure with journal titles.

infomap_communities <- info_directed %>%
  left_join(JCR_abbreviations, by = c(name = 'JCR.Abbreviated.Title')) %>%
  mutate(Full.Journal.Title = capitaliseJournalTitle(Full.Journal.Title)) %>%
  distinct(Full.Journal.Title, .keep_all = TRUE)

## Warning: Column `name`/`JCR.Abbreviated.Title` joining factor and character
## vector, coercing into character vector

A quick unit test to check that all the journals appear in the abbreviation translation table:

Tests passed

stopifnot(all(info_directed$name %in% infomap_communities$name))

Then save the data to disk.

infomap_communities %>%
  select(community = level1,
         journal_abbr = name,
         journal_full = Full.Journal.Title) %>%
  write.csv('output/infomap_communities.csv',
            row.names = FALSE)

Super-journals

To rank communities against one another it is useful to combine communities into ‘super-journals’ and study the citations between them.

Let’s (optionally) save the Infomap citations to disk:

infomap_superjournals <- JCR %>%
  inner_join(infomap_communities %>%
               select(from = level1, name),
             by = c(Citing = 'name')) %>%
  inner_join(infomap_communities %>%
               select(to = level1, name),
             by = c(Cited = 'name')) %>%
  group_by(from, to) %>%
  summarise(weight = sum(weight)) %>%
  ungroup()

# write.csv(infomap_superjournals,
#           'output/citations_infomap.csv',
#           row.names = FALSE)

With a quick unit test that we have the right number of communities:

Tests passed

ncomms <- length(unique(infomap_communities$level1))
stopifnot(length(unique(infomap_superjournals$from)) == ncomms)
stopifnot(length(unique(infomap_superjournals$to)) == ncomms)

And the Louvain citations:

louvain_superjournals <- JCR %>%
  inner_join(louvain_named %>% select(from = community, journal), by = c(Citing = 'journal')) %>%
  inner_join(louvain_named %>% select(to = community, journal), by = c(Cited = 'journal')) %>%
  group_by(from, to) %>%
  summarise(weight = sum(weight)) %>%
  ungroup()

# write.csv(louvain_superjournals,
#           'output/citations_louvain.csv',
#           row.names = FALSE)

Ranking communities

Now we have aggregated the communities into ‘super-journals’, let’s rank them. The remainder of this document is concerned with Infomap communities only.

We use the quasi-Stigler model described in Varin et al. (2016).

Inter-field ranking

Stigler_superjournals <- 
  infomap_superjournals %>%
  xtabs(weight ~ to + from, data = .) %>%
  scrooge::BradleyTerry()

Stigler_superjournal_scores <-
  Stigler_superjournals %>%
  coef() %>%
  c(0, .) %>%
  setNames(1:ncomms) %>%
  (function(x) x - mean(x))

One way to calculate approximate confidence intervals is with quasivariances, for which we shall use the qvcalc package.

covariance_matrix <- matrix(0, ncomms, ncomms)
covariance_matrix[-1, -1] <- vcov(Stigler_superjournals)

## Warning in summary.glm(object, ...): observations with zero weight not used
## for calculating dispersion

interfield_quasivariances <- qvcalc(covariance_matrix,
                                    estimates = Stigler_superjournal_scores)

Intra-field ranking

First we need to filter the Journal Citation Reports down to the strongly-connected core.

JCR_core <- JCR %>%
  filter(Citing %in% V(G)$name,
         Cited %in% V(G)$name)

Then we bind the individual journals in the JCR data to their respective communities.

JCR_Infomap <- JCR_core %>%
    left_join( infomap_communities %>%
    select(Citing = name, fromCommunity = level1) ) %>%
  left_join( infomap_communities %>%
    select(Cited = name, toCommunity = level1) ) %>%
  select(Cited, Citing, weight, fromCommunity, toCommunity)

## Joining, by = "Citing"

## Joining, by = "Cited"

Then we can fit a Bradley–Terry model to each community. This uses Ella Kaye’s BradleyTerryScalable package.

The following function attempts to fit a quasi-Stigler model to each of the 69 communities.

tryBradleyTerry <- function(d, player0 = TRUE) {
  union(pull(d, Cited), pull(d, Citing)) %>%
    unique %>% length -> nUniqueJournals
  if ( nUniqueJournals < 2 )
    return(NA)
  tryCatch(
    btfit(btdata(d), a = 1),
    error = function(e) e
  )
}

Once each model has been fitted (or an error returned, in the case of singleton communities), we can then extract the model coefficients (else return NA).

tryBTscores <- function(model) {
  if (!inherits(model, 'btfit'))
    return (NA)
  tryCatch({
    if ( is.list(coef(model)) )
      unname(coef(model)) # removes spurious prefixes like '1.PHILOS PUBLIC AFF'
        # which can occur if network does not appear to be fully connected
    else
      coef(model)
    },
    error = function(e) e)
}

Player 0

Algorithms such as PageRank have a damping factor that ensures ergodicity when networks may not be well-connected. This is a concern for us, too. How can we make sure that the quasi-Stigler model fitted to each community is stable?

By introducing a ‘player 0’ or ‘journal 0’, who wins and loses against every other player a constant number of times. This effectively adds shrinkage to the model; any journals with a very small number of citations given or received are likely to see their scores shrink towards the mean (i.e. 0).

For a feeling of familiarity with PageRank, but not for any particularly theoretically-justified reason, we will use the parameter 0.15, so that each player zero cites and is cited \[0.15 \times \frac{\sum_{i=1}^n \sum_{j=1}^n c_{ij}} {2n}\] times, where \(c_{ij}\) is the number of citations from journal \(i\) to journal \(j\), and \(n\) is the total number of journals in that community (excluding the 0^th journal, of course).

player0_weights <- JCR_Infomap %>%
  filter(fromCommunity == toCommunity) %>% # Within-field citations only
  group_by(community = fromCommunity) %>%
  select(-fromCommunity, -toCommunity) %>%
  summarise(weight = .15 * sum(weight) / length(union(Cited, Citing)) / 2)

inPlayer0Weights <- seq_len(max(JCR_Infomap$fromCommunity)) %in% player0_weights$community
if (!all(inPlayer0Weights))
  warning('Community ', which(!inPlayer0Weights), ' not in `player0_weights`')

## Warning: Community 69 not in `player0_weights`

player0_grid <- expand.grid(
    Cited = 'Player 0',
    Citing = union(JCR_Infomap$Cited, JCR_Infomap$Citing)
  ) %>%
  left_join(select(infomap_communities, community = level1, name),
            by = c(Citing = 'name')) %>%
  rbind(select(., Cited = Citing, Citing = Cited, community)) %>%
  left_join(player0_weights, by = 'community')

## Warning: Column `Citing`/`name` joining factor and character vector,
## coercing into character vector

‘All others’

Especially interesting is the introduction of a super-journal called ‘All others’, representing the rest of the network. The magnitude of the shift in journal ranking this induces can be used as some measure of how inter-disciplinary (or insular) a journal is. We won’t bother with a ‘player 0’ in this case, since the ‘All others’ journal will effectively play a similar role.

It is safe to ignore the number of self-citations that ‘All others’ has, since they will cancel out in the model anyway.

other_weights <- JCR_Infomap %>% 
  filter(fromCommunity != toCommunity) %>%
  group_by(Cited) %>%
  summarise(Citing = 'Other fields',
            weight = sum(weight),
            toCommunity = unique(toCommunity),
            fromCommunity = toCommunity) %>%
  bind_rows(
    JCR_Infomap %>%
      filter(fromCommunity != toCommunity) %>%
      group_by(Citing) %>%
      summarise(Cited = 'Other fields',
                weight = sum(weight),
                fromCommunity = unique(fromCommunity),
                toCommunity = fromCommunity)
  )

Fitting quasi-Stigler models

Now we fit the models.

Stigler_models <- JCR_Infomap %>%
  bind_rows(other_weights) %>%
  # Intra-community citations only:
  filter(fromCommunity == toCommunity) %>% 
  group_by(community = fromCommunity) %>%
   # The two community cols are now redundant:
  select(-fromCommunity, -toCommunity) %>%
  # Add `player 0` to each community:
  bind_rows(player0_grid) %>%
  # Fit quasi-Stigler models with/without player 0
  do(Stigler = select(., -community) %>%
       filter(Cited != 'Player 0',
              Citing != 'Player 0',
              Cited != 'Other fields',
              Citing != 'Other fields') %>%
       tryBradleyTerry,
     Stigler0 = select(., -community) %>%
        filter(Cited != 'Other fields',
               Citing != 'Other fields') %>%
       tryBradleyTerry,
     StiglerXF = select(., -community) %>% # XF = 'external fields'
       filter(Cited != 'Player 0',
              Citing != 'Player 0') %>%
       tryBradleyTerry) %>%
  mutate(BTscores = list(tryBTscores(Stigler)),
         BTscores0 = list(tryBTscores(Stigler0)),
         BTscoresXF = list(tryBTscores(StiglerXF)))

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

Quasi-variances

See Firth and de Menezes (2004). ‘Quasi-variances’. Biometrika, 91, 1, pp. 65–80. The qvcalc R package is not well optimised for large datasets. We can get around this by writing our own quasi-variance loss functions in C++ and minimising using BFGS.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double logloss2 (NumericVector q, NumericMatrix B) {
  int N = B.nrow();
  double v, e, loss = 0.0;
  
  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < i; ++j) {
      v = B(i, i) + B(j, j) - 2 * B(i, j); // Variance of simple contrast
      e = (q[i] + q[j]) / v; // Relative error
      loss += log(e) * log(e); // Squared log loss
    }
  }
  
  return loss;
}

// [[Rcpp::export]]
NumericVector Dlogloss2 (NumericVector q, NumericMatrix B) {
  int N = B.nrow();
  double v, e;
  NumericVector grad(N);
  
  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < i; ++j) {
      v = B(i, i) + B(j, j) - 2 * B(i, j);
      e = (q[i] + q[j]) / v;
      grad[i] += 2 * log(e) / (q[i] + q[j]);
    }
  }
  
  return grad;
}

With the functions defined, let’s optimise:

tryQV <- function(model) {
  if (!inherits(model, 'btfit'))
    return(NA)
  vCov <- as.matrix(vcov(model))
  q0 <- diag(vCov)
  qvars <- optim(q0, logloss2, method = 'BFGS', gr = Dlogloss2, B = vCov)$par
  return (sqrt(qvars)) # But sometimes quasi-variances are negative!
}

tryQV2 <- function(model) {
  # Return standard errors of differences with 'Other fields'
  # as a reference point. Because quasi-variances unstable.
  # Var(mu1 - mu2) = Var(mu1) + Var(mu2) - 2 * Cov(mu1, mu2)
  # where Var('Other fields') = Cov('Other fields', .) = 0
  if (!inherits(model, 'btfit'))
    return(NA)
  vCov <- vcov(model, ref = 'Other fields')
  SEs <- sqrt(Matrix::diag(vCov))
  return(SEs)
}

Stigler_models <- Stigler_models %>%
  mutate(quasiSE0 = list(tryQV(Stigler0)),
         quasiSEother = list(tryQV2(StiglerXF)))

Quasi-variances don’t work so well at approximating the comparison intervals when the ‘Other fields’ superjournal is included. Therefore for the ‘wider influence’ case, we use tryQV2, which calculates the variances of comparisons relative to the ‘Other fields’ superjournal. This means our ‘wider influence’ comparison intervals are only valid for comparisons with the ‘Other fields’ journal; not between regular journals. At least we don’t end up with negative quasi-variances!

Parse models

Now we need to do a bit of housekeeping. Where communities contained a single journal, no model has been fitted and instead NA has been returned. Also, some communities are not strongly connected, in which case the dangling nodes have been omitted without coefficients. This doesn’t really apply to the damped case, because all the nodes should have been artificially connected by player 0.

# Export scores
Stigler_scores <-
  Stigler_models %>%
  rowwise() %>%
  mutate(BTscores = list(unlist(BTscores))) %>%
  pull(BTscores) %>% unlist() %>%
  data_frame(journal = names(.),
             influence = .)

Stigler_scores0 <-
  Stigler_models %>%
  pull(BTscores0) %>% unlist() %>%
  data_frame(journal = names(.),
             influence = .)

# Quasi variances
Stigler_qvars0 <-
  Stigler_models %>%
  pull(quasiSE0) %>% unlist() %>%
  data_frame(journal = names(.),
             quasiSE = .)

Parsing the ‘all other’ journals requires some extra work, because it is a journal which has the same name in every community.

Stigler_scores_other <-
  Stigler_models %>%
  pull(BTscoresXF) %>% unlist() %>%
  data_frame(journal = names(.),
             influence2 = .)

Stigler_qvars_other <-
  Stigler_models %>%
  pull(quasiSEother) %>% unlist() %>%
  data_frame(journal = Stigler_scores_other$journal,
             quasiSE2 = .)

# For binding to the end of the results
external_influence <- Stigler_scores_other %>%
  filter(journal == 'Other fields') %>%
  cbind(Stigler_qvars_other %>%
          filter(journal == 'Other fields') %>%
          select(-journal)) %>%
  mutate(memberOf = row_number())

What effect does player 0 have on scores?

Stigler_scores0 %>%
  left_join(Stigler_scores, by = 'journal', suffix = c('.Player0', '.mle')) %>%
  ggplot(aes(influence.mle, influence.Player0)) +
    geom_point(alpha = .5) +
    labs(x = 'Ordinary Stigler-model export scores',
         y = 'Damped with \'Player 0\'',
         caption = 'Damping set to .15*N/d/2')

## Warning: Removed 117 rows containing missing values (geom_point).

And the ‘All others’ journals?

Stigler_scores_other %>%
  left_join(Stigler_scores, by = 'journal') %>%
  ggplot(aes(influence, influence2)) +
    geom_point(alpha = .5) +
    labs(x = 'Within-field Stigler-model export scores',
         y = 'Including inter-field citations')

## Warning: Removed 115 rows containing missing values (geom_point).

Save results

Though our data are arguably hierarchical, we will produce a ‘tidy’ structure, where each row describes the super-journal to which that journal belongs, and where every super-journal belongs to a pseudo-super-journal 0.

This file will be saved to csv so it can be read into d3.js, or other visualisation software.

interfield <- data_frame(
  journal = names(Stigler_superjournal_scores),
  influence = Stigler_superjournal_scores,
  quasiSE = interfield_quasivariances$qvframe$quasiSE,
  memberOf = 0L
)

intrafield <- infomap_communities %>%
  select(journal = name,
         memberOf = level1,
         full_name = Full.Journal.Title) %>% # = NA for superjournals
  left_join(Stigler_scores0, by = 'journal') %>%
  left_join(Stigler_qvars0, by = 'journal') %>%
  left_join(Stigler_scores_other, by = 'journal') %>%
  left_join(Stigler_qvars_other, by = 'journal') %>%
  bind_rows(external_influence)

if (nmissing <- sum(is.na(intrafield$influence))) {
  warning(nmissing, ' journal/s have no influence score\n',
          sum(is.na(intrafield$quasiSE)), ' journal/s have no quasi-variance\n')
}

## Warning: 70 journal/s have no influence score
## 70 journal/s have no quasi-variance

output <- bind_rows(interfield, intrafield) %>%
  group_by(memberOf) %>%
  mutate(rank = dense_rank(desc(influence)),
         rank2 = dense_rank(desc(influence2))) %>%
  ungroup()
  
write.csv(output, 'docs/data/rankings.csv', row.names = FALSE)

Measuring influence in scientific fields

David Selby

David Firth

2 March 2018

Import citation data

Finding communities

Formatting citation data

Run Louvain algorithm

Run Infomap algorithm

Super-journals

Ranking communities

Inter-field ranking

Intra-field ranking

Player 0

‘All others’

Fitting quasi-Stigler models

Quasi-variances

Parse models

Save results