rostrum.blog - Assign abstracts to sifters

tl;dr

A quick and dirty R function to assign abstracts to sifters.

Assign me up

I hacked together a function to assign submitted conference abstracts to sifters for assessment. At its simplest you give it two dataframes: one with a row per abstract, one with a row per sifter. You receive back a list, one element per sifter, with the abstracts they’ve been assigned.

There were several criteria that complicated things. The function:

tries to make assignment counts equal between sifters where possible
results in each abstract being assigned n times, to assure fairness in assessment
ensures each sifter receives a unique set of abstracts
prevents the sifter seeing their own abstracts, if they submitted any
prevents the sifter seeing abstracts by authors with the same affiliation as the sifter, if relevant
respects a maximum assignment ‘cap’, if the sifter has one
allows the user to set a maximum number of iterations to prevent infinite looping (may be possible under certain conditions)

Process

The function itself is split into three main parts:

Setup of variables and counters to be used in the iterations.
A repeat loop that will keep assigning abstracts to the set of sifters until the total abstract pool is exhausted and the various criteria are met.
Within the repeat loop, a for loop that iterates over each sifter to assign them an abstract from their pool of viable abstracts.

Within the for loop are three main steps:

Find the pool of abstracts available to the sifter given various criteria (if any).
Select randomly an abstract from the pool and assign it to the sifter.
Increment the assignment counter for the selected abstract.

The for loop will go to the next sifter if the pool of abstracts for the current sifter is zero. The repeat loop will break if all of the abstracts have been assigned n number of times, according to the assignment_cap argument. It will also break if the number of iterations given by max_iterations has been met.

Definition

The function was developed quickly, is not optimised, is not fully tested and has no defensive programming. But it fulfilled the requirements for the task. I’m recording it here for posterity.

I’ve added some comments and tried to make variable names informative. The abstracts_df and sifters_df inputs are dataframes that have columns for the name and affiliation, along with a maximum-assignment cap column in the sifters_df.

assign_abstracts <- function(
    abstracts_df,
    sifters_df,
    assignment_cap = 2, 
    max_iterations = 1000
) {
  
  # Set up named vectors
  abstracts <- with(abstracts_df, setNames(affiliation, name))
  sifters <- with(sifters_df, setNames(affiliation, name))
  sifter_caps <- with(sifters_df, setNames(cap, name))
  sifter_caps <- sifter_caps[!is.na(sifter_caps)]
  
  # Set up starting variables
  n_abstracts <- length(abstracts)
  seq_abstracts <- seq_len(n_abstracts)
  sifter_assignments <- setNames(vector("list", length(sifters)), names(sifters))
  assignment_counts <- rep(0, n_abstracts)
  iter <- 0
  
  repeat {
    
    for (name in names(sifter_assignments)) {
      
      # 1. Find the pool of abstracts available to this sifter (if any)
      
      # a. Check if sifter cap has been met
      sifter_has_cap <- name %in% names(sifter_caps)
      if (sifter_has_cap) {
        sifter_cap <- sifter_caps[[name]]
        sifter_assignment_count <- length(sifter_assignments[[name]])
      }
      if (sifter_has_cap && sifter_assignment_count == sifter_cap) next
      
      # b. Add abstracts to pool if they have <n assignments
      abstracts_under_cap <- which(assignment_counts < assignment_cap)
      if (length(abstracts_under_cap) == 0) next
      
      # c. Remove abstracts that are already assigned to this sifter
      already_assigned_to_sifter <- sifter_assignments[[name]]
      abstracts_available <- 
        abstracts_under_cap[!abstracts_under_cap %in% already_assigned_to_sifter]
      if (length(abstracts_available) == 0) next
      
      # d. Remove abstracts by the named sifter
      abstracts_by_sifter <- which(name == names(abstracts[abstracts_available]))
      if (length(abstracts_by_sifter) > 0) {
        abstracts_available <- abstracts_available[-abstracts_by_sifter]
      }
      if (length(abstracts_available) == 0) next
      
      # e. Remove abstracts with the same affiliation as the sifter
      sifter_affiliation <- unname(sifters[name])
      abstracts_by_same_affiliation <- 
        which(sifter_affiliation == unname(abstracts[abstracts_available]))
      if (length(abstracts_by_same_affiliation) > 0) {
        abstracts_available <- 
          abstracts_available[-abstracts_by_same_affiliation]
      }
      if (length(abstracts_available) == 0) next
      
      # 2. Select randomly from pool and assign to sifter
      abstract_selected <- .resample(abstracts_available, 1)
      sifter_assignments[[name]] <- 
        c(sifter_assignments[[name]], abstract_selected)
      
      # 3. Increment count for sampled abstract
      assignment_counts[abstract_selected] <- 
        assignment_counts[abstract_selected] + 1
      if (all(assignment_counts == assignment_cap)) break
      
    }
    
    # Reorder so sifter with fewest assignments gets next assignment first
    sifter_assignments <- sifter_assignments[order(lengths(sifter_assignments))]
    
    iter <- iter + 1
    
    if (all(assignment_counts == assignment_cap)) break
    
    if (iter == max_iterations) {
      message("max_iterations reached")
      break
    }
    
  }
  
  sifter_assignments <- lapply(sifter_assignments, sort)
  sifter_assignments[order(names(sifter_assignments))]
  
}

.resample <- function(x, ...) x[sample.int(length(x), ...)]

Of course, it’s too big and should be broken into smaller functions, particularly each of the steps in the for loop. Also, you feed in dataframes, but these are converted immediately to named vectors for processing. In part this reflects the ease of handling named vectors, but is also a legacy of when the requirements were far simpler. The requirements grew more complicated over time, so it became a Frankenfunction

Note also the bespoke .resample() function because sample() operates differently depending on whether you give it a vector or a single value¹. If only one abstract is left in the pool, e.g. abstract number 13, then sample(13) won’t output 13, it will actually output a value from 1 to 13.

Example

Demo data

Let’s create some fake data using the {charlatan} package. Let’s imagine we have some sifters and their affiliations. One sifter only has time to do 10 assessments, so they have an assignment cap value.

set.seed(1)

n_sifters <- 5
sifter_companies <- charlatan::ch_company(n_sifters)
sifter_names <- charlatan::ch_name(n_sifters)

(sifters_df <- data.frame(
  name = sifter_names,
  affiliation = sifter_companies,
  cap = c(10, rep(NA_real_, n_sifters - 1))
))

              name                        affiliation cap
1 Justen Powlowski                         Paucek Inc  10
2 Jon Blick-Erdman                    Ziemann-Ziemann  NA
3   Cannon Hegmann                        Wyman-Wyman  NA
4  Nichelle Schoen Daugherty, Daugherty and Daugherty  NA
5   Earley Monahan                        Walsh-Walsh  NA

Now let’s create some fake abstracts, again with names and affiliations. The abstract titles here are just random species names, so let’s pretend we’re at a taxonomists’ conference or something. Let’s make it so the sifters have each submitted an abstract of their own and that there’s at least one other submission from their organisation.

Of course, your abstract dataset is likely to contain additional information, like the actual text of the abstract and other details like the author’s geographic location and talk-type preference (poster, plenary, etc). If you’ve used an online survey service then you can usually download a CSV of the results or connect to their API to get the data.

total_abstracts <- 30

abstracts_df <- data.frame(
  name = c(sifter_names, charlatan::ch_name(total_abstracts - n_sifters)),
  affiliation = c(
    rep(sifter_companies, 2),
    charlatan::ch_company(total_abstracts - (2 * n_sifters))
  ),
  title = charlatan::ch_taxonomic_species(total_abstracts)
)

abstracts_df <- abstracts_df[sample(nrow(abstracts_df)), ]  # shuffle
row.names(abstracts_df) <- NULL

head(abstracts_df)

                  name                        affiliation                 title
1     Jon Blick-Erdman                    Ziemann-Ziemann   Coniogramme euantha
2       Tavaris Reilly                        Wyman-Wyman       Tristemon egena
3 Shavonne Ziemann PhD            Satterfield-Satterfield  Pichleria majungense
4      Nichelle Schoen Daugherty, Daugherty and Daugherty      Zelkova vigilans
5        Durrell Mertz                Jaskolski-Jaskolski Papuechites denutatum
6      Mikayla Rau DDS                        Stark-Stark   Oldfieldia mohriana

Run

Let’s provide the abstracts_df and sifters_df dataframes to the function, along with the number of times each abstract will need to be assessed.

n <- 2

assignments <- assign_abstracts(
  abstracts_df,
  sifters_df,
  assignment_cap = n
)

Here’s what the output looks like. It’s a named list with one element per sifter. The values are the index of that abstract in the vector provided to the abstracts_df argument.

assignments

$`Cannon Hegmann`
 [1]  1  4  6  8 10 12 16 19 20 26 28 29

$`Earley Monahan`
 [1]  1  4  5  7 10 12 13 17 22 25 27 28

$`Jon Blick-Erdman`
 [1]  7  8 11 14 15 16 17 18 21 22 23 24 30

$`Justen Powlowski`
 [1]  2  3  5  9 14 18 20 23 26 29

$`Nichelle Schoen`
 [1]  2  3  6  9 11 13 15 19 21 24 25 27 30

These indices can be matched back to the original dataset. Here’s an example for the first sifter.

assignment_df <- abstracts_df[assignments[[1]], ]
head(assignment_df)

                             name                        affiliation
1                Jon Blick-Erdman                    Ziemann-Ziemann
4                 Nichelle Schoen Daugherty, Daugherty and Daugherty
6                 Mikayla Rau DDS                        Stark-Stark
8                  Earley Monahan                        Walsh-Walsh
10               Heriberto Feeney                    Ziemann-Ziemann
12 Muhammad Stoltenberg-Hermiston                        Tillman LLC
                     title
1      Coniogramme euantha
4         Zelkova vigilans
6      Oldfieldia mohriana
8      Isoetella flaccidum
10     Tylophora serrulata
12 Microtropis turubalense

You could wrangle this into an anonymised dataframe with columns for the sifter to provide their assessment.

anon_df <- assignment_df[, "title", drop = FALSE]
anon_df$score <- NA_real_
anon_df$comments <- NA_character_
anon_df

                         title score comments
1          Coniogramme euantha    NA     <NA>
4             Zelkova vigilans    NA     <NA>
6          Oldfieldia mohriana    NA     <NA>
8          Isoetella flaccidum    NA     <NA>
10         Tylophora serrulata    NA     <NA>
12     Microtropis turubalense    NA     <NA>
16         Coussapoa anatuyana    NA     <NA>
19 Chiloglottis brideliifolius    NA     <NA>
20            Crocus candollei    NA     <NA>
26         Pistacia weinmannii    NA     <NA>
28         Baillaudea dodsonii    NA     <NA>
29         Syngonium tangutica    NA     <NA>

And then you can return this back to the sifter. The low-tech mechanism would be to put this into a spreadsheet output with {openxlsx}, for example. Much better would be to create a simple Shiny app hosted on Posit Connect or something, allowing each sifter to see their assigned abstracts and submit their assessments.

Check

Great, but the output actually meet the initial requirements for the system? Let’s take a look.

Was each abstract assigned the number of times specified by assignment_cap?

all(table(unlist(assignments)) == n)

[1] TRUE

Here you can see that sifters received a near-equal number of abstracts, apart from the sifter who had a specified maximum-assignment cap.

lengths(assignments)

  Cannon Hegmann   Earley Monahan Jon Blick-Erdman Justen Powlowski 
              12               12               13               10 
 Nichelle Schoen 
              13

Was each sifter assigned a unique set of abstracts?

all(lengths(lapply(assignments, unique)) == lengths(assignments))

[1] TRUE

Did anyone receive their own abstract?

sifter_names <- sifters_df[sifters_df$name %in% names(assignments), "name"]

has_own_abstract <- vector("list", length = length(sifter_names)) |> 
  setNames(sifter_names)

for (i in seq_along(sifter_names)) {
  sifter_name <- sifter_names[i]
  abstract_names <- abstracts_df$name[assignments[[i]]]
  has_own_abstract[[i]] <- all(sifter_name == abstract_names)
}

unlist(has_own_abstract)

Justen Powlowski Jon Blick-Erdman   Cannon Hegmann  Nichelle Schoen 
           FALSE            FALSE            FALSE            FALSE 
  Earley Monahan 
           FALSE

Did any of the sifters get assigned abstracts from their own affiliation?

affiliations <- 
  sifters_df[sifters_df$name %in% names(assignments), "affiliation"]

has_affiliate_abstract <- vector("list", length = length(assignments)) |> 
  setNames(affiliations)

for (i in seq_along(affiliations)) {
  sifter_affiliation <- affiliations[i]
  abstract_affiliations <- abstracts_df$affiliation[assignments[[i]]]
  has_affiliate_abstract[[i]] <- all(sifter_affiliation == abstract_affiliations)
}

unlist(has_affiliate_abstract)

                        Paucek Inc                    Ziemann-Ziemann 
                             FALSE                              FALSE 
                       Wyman-Wyman Daugherty, Daugherty and Daugherty 
                             FALSE                              FALSE 
                       Walsh-Walsh 
                             FALSE

Okey-doke.

What now?

This could definitely be better.

As mentioned, there’s a lot of refactoring that could be done, recognising that it was developed rapidly with changing requirements. I’m reflecting on it now that it’s solved the problem, but eventually it may be refactored or rewritten from scratch.

This would make sense if we (or you) want to use it in other scenarios or as part of a more generic package in future.

Or, as usual, this functionality probably exists in some package already and you can tell me all about it.

Environment

Session info

Last rendered: 2024-03-27 15:42:32 GMT

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.33     utf8_1.2.4        R6_2.5.1          fastmap_1.1.1    
 [5] xfun_0.41         magrittr_2.0.3    glue_1.7.0        tibble_3.2.1     
 [9] knitr_1.45        pkgconfig_2.0.3   htmltools_0.5.6.1 rmarkdown_2.25   
[13] lifecycle_1.0.4   cli_3.6.2         fansi_1.0.6       vctrs_0.6.5      
[17] compiler_4.3.1    rstudioapi_0.15.0 tools_4.3.1       whisker_0.4.1    
[21] pillar_1.9.0      evaluate_0.23     charlatan_0.5.1   yaml_2.3.8       
[25] rlang_1.1.3       jsonlite_1.8.7    htmlwidgets_1.6.2

Footnotes

The ‘single sample switch’ as Patrick Burns puts it in The R Inferno (section 8.2.33).↩︎

Reuse

CC BY-NC-SA 4.0