1.2 Other preparation

1.2.1 Bookmarks

Bookmark these pages:

These notes https://bmacgtpm.github.io/notes/
The Github page for these notes https://github.com/bmacGTPM/notes. If you want to work with the R Markdown versions of these notes, you can find them in that GitHub repo. You can also ask questions, and create pull requests to add content or fix typos. In the main folder, there are several Rmd files:
- index.Rmd. This is the first Rmd file and corresponds to the first section of the Notes https://bmacgtpm.github.io/notes/
- All other Rmd files are numbered and appear in the Notes in the same order as the Rmd files. For example, the Rmd files for the dplyr and ggplot sections in the Appendix are
  - 99-01-appendix-dplyr.Rmd
  - 99-02-appendix-ggplot.Rmd
The pubtheme Github page https://github.com/bmacGTPM/pubtheme
These books
- Beyond Multiple Linear Regression https://bookdown.org/roback/bookdown-BeyondMLR/
- Regression and Other Stories https://avehtari.github.io/ROS-Examples/
- Introduction to Statistical Learning https://www.statlearning.com/
These resources

1.2.2 Backing up work

Most of your coding should take place in a source file or R Markdown file. Very little coding should occur in the console. This will make it far easier to reproduce what you did and back up your work.

There are a couple of options for backing up your work

Use a cloud service like OneDrive, Google Drive, Dropbox, Box, iCloud, or use Time Machine, to automatically back up your files.
If you are using Git/Github as part of your typical workflow (recommended), you’ll have a copy of your code in the “cloud” every time you push to Github.
Manually backup your files to an external hard drive on a regular basis.

1.2.3 Test code often

Test your code by running your script or knitting your R Markdown file often. This will help you catch errors early and it will likely make it easier to troubleshoot the errors.

1.2.4 dplyr and ggplot2

If you are unfamiliar with tidyverse and specifically the packages dplyr and ggplot2, there are two sections in the Appendix that are quick intros of dplyr and ggplot2:

The end of those sections contain links to more in-depth resources.

1.2.5 Minimal reproducible examples

When asking someone for help in an email, on Slack, on Github, on a discussion board (e.g. stackoverflow.com), etc., use a minimal reproducible example.

Minimal: Use as little code as necessary that still results in the error
Reproducible: provide all code and data necessary so that someone can copy/paste and reproduce the problem on their own machine.

A minimal reproducible example makes it easier for someone to help you, and makes it easier to troubleshoot your own code. It might help an LLM help you also (I haven’t tried this yet).

Here is the Stack Overflow page on the topic: https://stackoverflow.com/help/minimal-reproducible-example

Here is an example. Suppose you are working on using game results data to create a schedule matrix and have this code:

Code

library(pubtheme)
library(scales)
library(tidyverse)

d = readRDS('data/games.rds')

dg = d %>% 
  filter(lg == 'nba', 
         season %in% 2022, 
         season.type == 'reg') %>%
  group_by(away, 
           home, 
           season) %>%
  summarise(games = n()) %>%
  ungroup() %>%
  complete(away, 
           home, 
           fill = list(games = 0))

head(dg)

title = "Number of Games Between Each Pair of Teams" 
g = ggplot(dg, 
           aes(x = home,
               y = away, 
               fill = games))+ 
  geom_tile(linewidth = 0.4, 
            show.legend = T, 
            color = pubdarkgray) + ## used char above so leg is discrete
  scale_fill_manual(values = c(pubbackgray, 
                               publightred, 
                               pubred)) +
  labs(title    = title,
       x = 'Home Team', 
       y = 'Away Team')

g %>%
  pub(type = 'grid') +
  theme(axis.text.x.top = element_text(angle = 90, 
                                       vjust = .5,
                                       hjust = 0))

This results in the error Error: Continuous value supplied to discrete scale. Suppose we want help on this error. If the person we are asking has the games.rds data, then we can start stripping down the ggplot code as much as possible until we still have the error. Most lines of code can be removed because they are unrelated to the error. You can start commenting out one line at a time. This still gives the same error:

Code

title = "Number of Games Between Each Pair of Teams" 
g = ggplot(dg, 
           aes(x = home, 
               y = away, 
               fill = games))+ 
  geom_tile(linewidth = 0.4, 
            show.legend = T, 
            color = pubdarkgray) + 
  scale_fill_manual(values = c(pubbackgray, 
                               publightred, 
                               pubred)) #+
  # labs(title    = title,
  #      x = 'Home Team', 
  #      y = 'Away Team')  
  
# g %>%
#   pub(type = 'grid') +
#   theme(axis.text.x.top = element_text(angle = 90, 
#                                        vjust = .5,
#                                        hjust = 0))

g

So we can delete those lines of code. If we remove scale_fill_manual the error goes away, so we have to keep that in.

Code

g = ggplot(dg, 
           aes(x = home, 
               y = away, 
               fill = games))+ 
  geom_tile(linewidth = 0.4, 
            show.legend = T, 
            color = pubdarkgray) + 
  scale_fill_manual(values = c(pubbackgray, 
                               publightred, 
                               pubred)) 
g

We can also try getting rid of some of the arguments, like linewidth, show.legend, color, and we still get the error. Also, we can change the colors pubbackgray, publightred, and pubred to 'gray’, 'lightpink', and 'red' which come with R and don’t require the pubtheme package.

Code

g = ggplot(dg, 
           aes(x = home, 
               y = away, 
               fill = games))+ 
  geom_tile() + 
  scale_fill_manual(values = c('gray', 
                               'lightpink', 
                               'red')) 
g

That is about all we can remove and still get the error. So if the person we are asking the question to has this data, they can copy/paste this code to their own computer and reproduce the error.

If the person we are asking doesn’t have the data, then we should use a widely available data set. The mtcars data is available to anyone with R and can be used here.

Code

head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Code

library(tidyverse) 
dg = mtcars %>%
  group_by(cyl, vs) %>%
  summarise(mpg = mean(mpg), 
            .groups = 'keep')
head(dg)

g = ggplot(dg, aes(x = cyl, 
                   y = vs,
                   fill = mpg))+ 
  geom_tile() + 
  scale_fill_manual(values = c('gray', 
                               'lightpink', 
                               'red'))
g

Anyone, regardless of whether they have your data set, can copy/paste this code to diagnose. They will hopefully notice that mpg, the variable chosen to fill by, is continuous, while scale_fill_manual applies to discrete color scales. You’d have to use a continuous color scale with the continuous variable mpg. Something like this works:

Code

library(tidyverse) 
dg = mtcars %>%
  group_by(cyl, vs) %>%
  summarise(mpg = mean(mpg), 
            .groups = 'keep')
head(dg)

g = ggplot(dg, 
           aes(x = cyl, 
               y = vs, 
               fill = mpg))+ 
  geom_tile() + 
  scale_fill_gradient(low = 'gray', 
                      high = 'red')
g

# A tibble: 5 × 3
# Groups:   cyl, vs [5]
    cyl    vs   mpg
  <dbl> <dbl> <dbl>
1     4     0  26  
2     4     1  26.7
3     6     0  20.6
4     6     1  19.1
5     8     0  15.1

Now that we know how we can fix the error, let’s go back to our original plot and replace scale_fill_manual with scale_fill_gradient.

Code

d = readRDS('data/games.rds')

dg = d %>% 
  filter(lg == 'nba', 
         season %in% 2022, 
         season.type == 'reg') %>%
  group_by(away, 
           home, 
           season) %>%
  summarise(games = n()) %>%
  ungroup() %>%
  complete(away, 
           home, 
           fill = list(games = 0))
head(dg)

title = "Number of Games Between Each Pair of Teams" 
g = ggplot(dg, 
           aes(x = home, 
               y = away, 
               fill = games))+ 
  geom_tile(linewidth = 0.4, 
            show.legend = T, 
            color = pubdarkgray) + 
  scale_fill_gradient(low = pubbackgray, 
                      high = pubred) +
  labs(title    = title,
       x = 'Home Team', 
       y = 'Away Team')

g %>%
  pub(type = 'grid') +
  theme(axis.text.x.top = element_text(angle = 90, 
                                       vjust = .5, 
                                       hjust = 0))

# A tibble: 6 × 4
  away  home  season games
  <chr> <chr> <chr>  <int>
1 ATL   ATL   <NA>       0
2 ATL   BKN   2022       1
3 ATL   BOS   2022       2
4 ATL   CHA   2022       2
5 ATL   CHI   2022       2
6 ATL   CLE   2022       2
[1] 85.89991
[1] 80
[1] 20
[1] 185.8999

Another option would have been to convert games to a character or factor and keep scale_fill_manual.

Code

dg = dg %>%
  mutate(games = as.character(games))

title = "Number of Games Between Each Pair of Teams" 
g = ggplot(dg, 
           aes(x = home, 
               y = away, 
               fill = games))+ 
  geom_tile(linewidth = 0.4, 
            show.legend = T, 
            color = pubdarkgray) + ## used char above so leg is discrete
  scale_fill_manual(values = c(pubbackgray, 
                               publightred, 
                               pubred)) +
  labs(title    = title,
       x = 'Home Team', 
       y = 'Away Team')

g %>% 
  pub(type = 'grid') +
  theme(axis.text.x.top = element_text(angle = 90, 
                                       vjust = .5, 
                                       hjust = 0))

[1] 85.89991
[1] 80
[1] 20
[1] 185.8999

1.2.6 Study/work habits

The main think I wanted to say here is that you should your best to find long uninterrupted periods of time to work/study on your data analysis. Some things that may help

Turn off notifications on your phone and computer, including texts, emails, and social media.
Put your phone out of reach, and not on the table right next to your keyboard, so that it is less tempting to pick it up.
Close your browser or email client so it less tempting to check.
Turn off any shows you are streaming on your computer or TV.
Music in your earphones can be OK. Most people suggest that music without lyrics is best. For musicians, playing any kind of music can be distraction because they are tempted to listen to it closely. If I play music, I usually choose some soft classical music station on my streaming service. Sometimes I’ll play something louder and with lyrics if I need to get energized. If you need to drown out noise, you can find a white noise track.
You might find yourself more productive if you have the option to stand sometimes. This could be done at a high top table or a at an adjusted sit/stand desk.