# Load some packages we need
library(rvest) # scrape a site
library(dplyr) # data manipulation
# CSS for post titles found using SelectorGadget
# (This is a bit of an odd one)
<- "._3wqmjmv3tb_k-PROt7qFZe ._eYtD2XCVieq6emjKBH3m"
css_select
# Scrape a specific named page
<- read_html("https://www.reddit.com/r/tea") %>% # read the page
tea_scrape html_nodes(css = css_select) %>% # read post titles
html_text() # convert to text
print(tea_scrape)
tl;dr
If you webscrape with R, you should use the the {polite} package. It helps you respect website terms by seeking permission before you scrape.
Ahoy-hoy
Ah, salutations, and welcome to this blog post about polite web scraping. Please do come in. I’ll take your coat. How are you? Would you like a cup of tea? Oh, I insist!
Speaking of tea, perhaps you’d care to join me in genial conversation about it. Where to begin? Let’s draw inspiration from popular posts on the Tea subreddit of Reddit. I’ll fetch the post titles using the {rvest} package from Hadley Wickham and get the correct CSS selector using SelectorGadget by Andrew Cantino and Kyle Maxwell.
[1] "What's in your cup? Daily discussion, questions and stories - September 08, 2019"
[2] "Marketing Monday! - September 02, 2019"
[3] "Uncle Iroh asking the big questions."
[4] "The officially licensed browser game of Game of Thrones has launched! Millions of fans have put themselves into the battlefield! What about you?"
[5] "They mocked me. They said that I was a fool for drinking leaf water."
[6] "100 years old tea bush on my estate in Uganda."
[7] "Cold brew colors"
[8] "Finally completed the interior of my tea house only needed a fire minor touches not now it’s perfect, so excited to have this as a daily tea spot"
That’ll provide us with some conversational fodder, wot wot.
It costs nothing to be polite
Mercy! I failed to doff adequately my cap before entering the website! They must take me for some sort of street urchin.
Forgive me. Perhaps you’ll allow me to show you a more respectful method via the {polite} package in development from the esteemed gentleman Dmytro Perepolkin? An excellent way ‘to promote responsible web etiquette’.
A reverential bow()
Perhaps the website owners don’t want people to keep barging in willy-nilly without so much as a ‘ahoy-hoy’.
We should identify ourselves and our intent with a humble bow()
. We can expect a curt but informative response from the site—via its robots.txt file—that tells us where we can visit and how frequently.
# remotes::install_github("dmi3kno/polite") # to install
library(polite) # respectful webscraping
# Make our intentions known to the website
<- bow(
reddit_bow url = "https://www.reddit.com/", # base URL
user_agent = "M Dray <https://rostrum.blog>", # identify ourselves
force = TRUE
)
print(reddit_bow)
## <polite session> https://www.reddit.com/
## User-agent: M Dray <https://rostrum.blog>
## robots.txt: 32 rules are defined for 4 bots
## Crawl delay: 5 sec
## The path is scrapable for this user-agent
Super-duper. The (literal) bottom line is that we’re allowed to scrape. The website does have 32 rules to stop unruly behaviour though, and even calls out four very naughty bots that are obviously not very polite. We’re invited to give a five-second delay between requests to allow for maximum respect.
Give a nod()
Ahem, conversation appears to be wearing a little thin; perhaps I can interest you by widening the remit of our chitchat? Rather than merely iterating though subpages of the same subreddit, we can visit the front pages of a few different subreddits. Let’s celebrate the small failures and triumphs of being British; a classic topic of polite conversation in Britain.
We’ve already given a bow()
and made out intentions clear; a knowing nod()
will be sufficient for the next steps. Here’s a little function to nod()
to the site each time we iterate over a vector of subreddit names. Our gentlemanly agreement remains intact from our earlier bow()
.
library(purrr) # functional programming tools
library(tidyr) # tidy-up data structure
<- function(subreddit_name, bow = reddit_bow, css_select){
get_posts
# 1. Agree modification of session path with host
<- nod(
session bow = bow,
path = paste0("r/", subreddit_name)
)
# 2. Scrape the page from the new path
<- scrape(session)
scraped_page
# 3. Extract from xpath on the altered URL
<- html_nodes(
node_result
scraped_page,css = css_select
)
# 4. Render result as text
<- html_text(node_result)
text_result
# 5. Return the text value
return(text_result)
}
Smashing. Care to join me in applying this function over a vector of subreddit names? Tally ho.
# A vector of subreddits to iterate over
<- set_names(c("BritishProblems", "BritishSuccess"))
subreddits
# Get top posts for named subreddits
<- map_df(
top_posts
subreddits,~get_posts(.x, css_select = "._3wqmjmv3tb_k-PROt7qFZe ._eYtD2XCVieq6emjKBH3m")
%>%
) gather(
BritishProblems, BritishSuccess,key = subreddit, value = post_text
)
::kable(top_posts) knitr
Bravo, what excellent manners we’ve demonstrated. You can also iterate over different query strings – for example if your target website displays information over several subpages – with the params
argument of the scrape()
function.
Oh, you have to leave? No, no, you haven’t overstayed your welcome! It was truly marvellous to see you. Don’t forget your brolly, old chap, and don’t forget to print the session info for this post. Pip pip!
Environment
Session info
Last rendered: 2023-08-02 23:36:13 BST
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.2 compiler_4.3.1 fastmap_1.1.1 cli_3.6.1
[5] tools_4.3.1 htmltools_0.5.5 rstudioapi_0.15.0 yaml_2.3.7
[9] rmarkdown_2.23 knitr_1.43.1 jsonlite_1.8.7 xfun_0.39
[13] digest_0.6.33 rlang_1.1.1 evaluate_0.21