10.1 Scrape HTML pages
Code
Using rvest
(https://rvest.tidyverse.org/):
Code
See https://rvest.tidyverse.org/articles/rvest.html for an intro to rvest
.
To save the object created using read_html
, you can use write_html
.
You can then use read_html
again to load myfilename
. If you try to use saveRDS
then readRDS
, and then try to do something with that object like use html_table
, you will get an error like Error in xml_ns.xml_document(x) : external pointer is not valid
.
The output read_html
doesn’t immediately look like the full HTML code like scan
gives. To get the full HTML code,
This won’t give exactly the same result as scan
. It seems to create more lines of code - some longer lines of HTML code are split into multiple entries in the vector.
[1] 187
[1] 1140
[1] " <head>"
{html_document}
<html lang="en-US" theme="light" data-color-scheme="light" class="desktop neo-green dock-upscale">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body id="atomic">\n <div id="sda-E2E" class="sdaContainer tw-flex ...