suppressPackageStartupMessages({
library(data.table) # a better data.frame
library(stringr) # simple string handling
library(tictoc) # timing
})
tl;dr
I used the R package {data.table} to find the highest- and lowest-scoring UK postcodes based on the sum of their numbers and letters (A = 1, B = 2, etc). You can jump to the results.
The premise
Yesterday I noticed that the hashtag #PostcodePandemonium was trending on Twitter.1 The premise was to sum the numbers and letters in your postcode, where the letters have been converted to their position in the alphabet (i.e. A = 1, B = 2, etc). Highest value ‘wins’.
Which existing postcode has the highest score? And the lowest?
Process
Attach packages
I’ve been using Matt Dowle and Arun Srinivasan’s lightning-fast {data.table} package recently and wanted to use it here to handle millions of UK postcodes. I’ve prioritised for readability in this post rather than efficiency, but let me know how to improve things.
I’m using Sergei Izrailev’s {tictoc} package to time the processes throughout.
Get the data
The latest postcode data (February 2020) is available on the Open Geography Portal by the Office for National Statistics. From there you can download a zipped folder that contains the file we want, NSPL_FEB_2020_UK.csv
.
Note
I re-rendered this post in July 2023 and the link to the February 2020 postcode file no longer works. Instead, we’ll use the latest at time of writing: February 2023.
First, you can download the .zip to a temporary location on your machine. The file is pretty large (about 180 MB), so I’m increasing the timeout value for download.file()
so that the download has time to complete. You might want to consider downloading the file to your local computer and reading it from there.
# URL to postcode zip file
<- paste0(
zip_url "https://www.arcgis.com/",
"sharing/rest/content/items/",
"c7debafcef564e7a9dfb8ca881be4253/data"
)
# Setup a temporary folder to download into
<- tempdir()
tmp <- file.path(tmp, "postcodes.zip")
zip_path
# Download the zip file to the tempporary location
options(timeout = max(1000, getOption("timeout")))
download.file(zip_url, zip_path)
You can then unzip()
the CSV file inside {data.table}’s fread()
for a rapid read.
tic("CSV read complete")
<- fread(unzip(zip_path, files = "Data/NSPL21_FEB_2023_UK.csv"))
pcodes_dt toc()
CSV read complete: 10.482 sec elapsed
unlink(tmp) # remove temp location
And we can check the dimensions of this object.
# Rows and columns in the data set
dim(pcodes_dt)
[1] 2687274 41
So there’s more than 2.5 million rows. Some postcodes have, however, been terminated over time. We’ll need to filter for the postcodes that are still active (thanks to Robert Kaleta for pointing this out).
We can also simplify to just the postcode column that we want using {data.table}‘s .()
notation. Data in the pcds
column has the consistent form of letter, letter, digit, space, digit, letter, letter (e.g. ’AB12 3CD’), which makes them relatively easy to deal with.
# Filter for empty date of termination (doterm)
# Retain only the postcode column
<- pcodes_dt[is.na(doterm), .(pcds)]
pcodes_dt
head(pcodes_dt)
pcds
1: AB10 1AB
2: AB10 1AF
3: AB10 1AG
4: AB10 1AH
5: AB10 1AL
6: AB10 1AN
nrow(pcodes_dt)
[1] 1793019
You can see that this removes a large number of terminated postcodes.
Extract
Now to extract the numbers and letters so that ‘AB12 3CD’ is broken into A, B, 12, 3, C and D, for example. Note that we want to extract multi-digit numbers if they exist within each half (the ‘outward’ and ‘inward’ parts) of the postcode, so 12 rather than 1 and 2, and 12 and 3 rather than 123.
The walrus operator (:=
) is used here as a function to create new columns and assign names to them. I’ve chose to use {stringr}’s str_extract_all()
function to match the strings we want. The regular expression contains values in the curly braces to indicate the desired character lengths to be matched.
This will produce two list-columns: one with the letters extracted into it and one with the numbers.
# Extract letters into one list column and numbers into another
`:=`(letter = str_extract_all(pcds, "[:alpha:]{1}"),
pcodes_dt[, number = str_extract_all(pcds, "[:digit:]{1,2}"))]
pcodes_dt
pcds letter number
1: AB10 1AB A,B,A,B 10,1
2: AB10 1AF A,B,A,F 10,1
3: AB10 1AG A,B,A,G 10,1
4: AB10 1AH A,B,A,H 10,1
5: AB10 1AL A,B,A,L 10,1
---
1793015: ZE3 9JU Z,E,J,U 3,9
1793016: ZE3 9JW Z,E,J,W 3,9
1793017: ZE3 9JX Z,E,J,X 3,9
1793018: ZE3 9JY Z,E,J,Y 3,9
1793019: ZE3 9JZ Z,E,J,Z 3,9
Remember that {data.table} edits in place, so the pcodes_dt
object will be updated and without the need to overwrite it (i.e. no need to do something like pcodes_dt <- pcodes_dt[<whatever>]
).
Numbers and letters
Now to work with the number
list-column. The values are currently character-class because they were extracted from the postcode strings; they need to be made numeric before they can be summed. lapply()
is used here to pass the function as.numeric()
to achieve this.
tic("Make numbers numeric class")
:= lapply(number, as.numeric)]
pcodes_dt[, number toc()
Make numbers numeric class: 4.502 sec elapsed
And now to work with the letter
list column. The custom function in lapply()
first turns the letters into the factor class, where the full set of possible levels is provided by the LETTERS
vector, and then uses as.numeric()
to convert each factor level to its corresponding numeric value.
This works on the principle that as.numeric(factor(c("A", "B", "C")))
becomes c(1, 2, 3)
. The first factor level, A
gets converted to 1, B
to 2 and so on.
tic("Convert letters to numbers, make numeric class")
:= lapply(
pcodes_dt[, letter_number function(x) as.numeric(factor(x, levels = LETTERS)))]
letter, toc()
Convert letters to numbers, make numeric class: 11.431 sec elapsed
Scores
Now to separately sum the number and letter values in each row of the list-columns and add them together for the final score.
# Generate summation columns for letters and numbers separately
`:=`(number_sum = lapply(number, sum),
pcodes_dt[, letter_sum = lapply(letter_number, sum))]
# Make the sum columns numeric- rather than list-class
$number_sum <- as.numeric(pcodes_dt$number_sum)
pcodes_dt$letter_sum <- as.numeric(pcodes_dt$letter_sum)
pcodes_dt
# Sum the number and letter values
:= number_sum + letter_sum]
pcodes_dt[, score
# The first few scores
head(pcodes_dt[, .(pcds, number_sum, letter_sum, score)])
pcds number_sum letter_sum score
1: AB10 1AB 11 6 17
2: AB10 1AF 11 10 21
3: AB10 1AG 11 11 22
4: AB10 1AH 11 12 23
5: AB10 1AL 11 16 27
6: AB10 1AN 11 18 29
So you can see, for example, that AB10 1AB has a number sum of 11 (10 + 1) and a letter sum of 6 (a couple of As and Bs, so 1 + 2 + 1 + 2), totalling 17.
Results
Now to order the results, focus on the postcodes and scores alone, and preview the top and bottom scores (provided by default in {data.table}’s print method).
# Select cols and reorder by score
order(-score), .(pcds, score)] pcodes_dt[
pcds score
1: WV99 1ZZ 197
2: WV98 1ZZ 196
3: WV99 1YZ 196
4: WV99 1ZY 196
5: SS99 9YX 195
---
1793015: B1 2AA 7
1793016: BA1 0BA 7
1793017: BA1 1AA 7
1793018: BA2 0AA 7
1793019: BA1 0AA 6
So the top-scoring postcode was WV99 1ZZ with 197 points. It’s on an industrial estate in Telford, north-east of Birmingham. You can view it on Google Maps.
The lowest scoring postcodes were in Birmingham (Holloway Circus at B1 1BA and Arena Birmingham at B1 2AA) and Bath (near Bath Spa train station at BA1 1AA and south of Farmborough at BA2 0AA). They scored only 7.
The distribution of scores looks like this:
hist(
$score,
pcodes_dtxlab = "Score",
main = "Histogram of postcode scores"
)
It’s slightly skewed, with nearly 350,000 instances of scores between 60 and 70 and very few scores over 150.
Let’s check out the summary statistics.
summary(pcodes_dt$score)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.00 54.00 67.00 68.86 81.00 197.00
So the mean score is just under 70.
How does your score compare?
Environment
Session info
Last rendered: 2023-07-20 20:36:47 BST
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tictoc_1.2 stringr_1.5.0 data.table_1.14.8
loaded via a namespace (and not attached):
[1] digest_0.6.33 fastmap_1.1.1 xfun_0.39 fontawesome_0.5.1
[5] magrittr_2.0.3 glue_1.6.2 knitr_1.43.1 htmltools_0.5.5
[9] rmarkdown_2.23 lifecycle_1.0.3 cli_3.6.1 vctrs_0.6.3
[13] compiler_4.3.1 rstudioapi_0.15.0 tools_4.3.1 evaluate_0.21
[17] yaml_2.3.7 rlang_1.1.1 jsonlite_1.8.7 htmlwidgets_1.6.2
[21] stringi_1.7.12
Footnotes
It originated from the social media team at a company controlled by one of the largest corporations in the world, so I don’t think it’s cynical to say that the whole thing was a marketing ploy.↩︎