10.1 Scrape HTML pages

Code
d1 = readLines(con = 'https://finance.yahoo.com/u/yahoo-finance/watchlists/most-active-small-cap-stocks')

## head(d1,2) ## huge mess of HTML code

Using rvest (https://rvest.tidyverse.org/):

Code
library(rvest)
d2 = read_html(
  'https://finance.yahoo.com/u/yahoo-finance/watchlists/most-active-small-cap-stocks'
  )

See https://rvest.tidyverse.org/articles/rvest.html for an intro to rvest.

To save the object created using read_html, you can use write_html.

Code
write_html(d2, file=myfilename)

You can then use read_html again to load myfilename. If you try to use saveRDS then readRDS, and then try to do something with that object like use html_table, you will get an error like Error in xml_ns.xml_document(x) : external pointer is not valid.

The output read_html doesn’t immediately look like the full HTML code like scan gives. To get the full HTML code,

Code
d3 = d2 %>% as.character() %>% strsplit(split = '\n') %>% unlist()

This won’t give exactly the same result as scan. It seems to create more lines of code - some longer lines of HTML code are split into multiple entries in the vector.

Code
length(d1)
length(d3)
[1] 187
[1] 1140
Code
d1[3]
d2
[1] "    <head>"
{html_document}
<html lang="en-US" theme="light" data-color-scheme="light" class="desktop neo-green dock-upscale">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body id="atomic">\n        <div id="sda-E2E" class="sdaContainer tw-flex ...