---
title: "Introducing educationdata"
author: "Kyle Ueyama"
date: "2021-05-26"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introducing educationdata}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The `educationdata` package allows the user to retrieve data from the Urban 
Institute's [Education Data API](https://educationdata.urban.org/) as a 
`data.frame` for analysis. The package contains one major function, 
`get_education_data`, which will get data from a specified API endpoint and 
return a `data.frame` to the user.

**NOTE**: By downloading and using this programming package, you agree to abide
by the Data Policy and Terms of Use of the Education Data Portal. For more 
information, see [https://educationdata.urban.org/documentation/#terms](https://educationdata.urban.org/documentation/#terms)

## Usage

The `get_education_data` function will return a `data.frame` from a call to 
the Education Data API.  

```{r usage-01, eval=FALSE}
library(educationdata)
get_education_data(level, source, topic, by, filters, add_labels, csv)
```

where:

* level (required) - API data level to query.
* source (required) - API data source to query.
* topic (required) - API data topic to query.
* by (optional) - Optional `list` of grouping parameters for an API call.
* filters (optional) - Optional `list` query to filter the results from an API 
call.
* add_labels - Add variable labels as factors (when applicable)? Defaults to 
`FALSE`.
* csv - Download the full csv file? Defaults to `FALSE`.

This simple example will obtain 'college-university' `level` data from the 
'ipeds' `source` for the 'student-faculty-ratio' `topic`:

```{r usage-02, message=FALSE}
library(educationdata)
 
df <- get_education_data(
   level = 'college-university',
   source = 'ipeds',
   topic = 'student-faculty-ratio'
 )

head(df)
```

A somewhat more complex example will obtain 'school' `level` data from the 
'ccd' `source` for the 'enrollment' `topic`, broken out `by` 'race' and 'sex'. 
The API query is subset with `filters` for the 'year' 2008, 'grade' 9 through 
12, and a 'ncessch' code of 340606000122. Finally, the `add_labels` flag will 
map integer codes to their factor labels ('race' and 'sex' in this instance).

```{r usage-03, message=FALSE}
library(educationdata)

df <- get_education_data(level = 'schools', 
                         source = 'ccd', 
                         topic = 'enrollment', 
                         by = list('race', 'sex'),
                         filters = list(year = 2008,
                                        grade = 9:12,
                                        ncessch = '340606000122'),
                         add_labels = TRUE)

head(df)
```

## Available Endpoints

```{r endpoints, echo=FALSE}
source('../R/get-endpoint-info.R')
df <- get_endpoint_info("https://educationdata.urban.org")

df$years_available <- gsub('and' ,'', df$years_available)
df$years_available <- gsub('\u20AC' ,'-', df$years_available)
df$years_available <- gsub('\u00E2' ,'', df$years_available)
df$years_available <- gsub('\u201C' ,'', df$years_available)
df$optional_vars <- lapply(df$optional_vars, 
                           function(x) paste(x, collapse = ', '))
df$required_vars <- lapply(df$required_vars, 
                           function(x) paste(x, collapse = ', '))
df <- df[order(df$endpoint_url), ]

vars <- c('section', 
          'class_name', 
          'topic', 
          'optional_vars',
          'required_vars',
          'years_available')

knitr::kable(df[vars], 
             col.names = c('Level', 
                           'Source', 
                           'Topic', 
                           'By',
                           'Main Filters',
                           'Years Available'),
             row.names = FALSE)
```

## Main Filters

Due to the way the API is set-up, the variables listed within 'main filters'
are often the fastest way to subset an API call.

In addition to `year`, the other main filters for certain endpoints 
accept the following values:

### Grade
| Filter Argument | Grade |
|-------------------|-------|
| `grade = 'grade-pk'` | Pre-K  |
| `grade = 'grade-k'`  | Kindergarten  |
| `grade = 'grade-1'` | Grade 1  |
| `grade = 'grade-2'` | Grade 2  |
| `grade = 'grade-3'` | Grade 3  |
| `grade = 'grade-4'` | Grade 4  |
| `grade = 'grade-5'` | Grade 5  |
| `grade = 'grade-6'` | Grade 6  |
| `grade = 'grade-7'` | Grade 7  |
| `grade = 'grade-8'` | Grade 8  |
| `grade = 'grade-9'` | Grade 9  |
| `grade = 'grade-10'` | Grade 10  |
| `grade = 'grade-11'` | Grade 11  |
| `grade = 'grade-12'` | Grade 12  |
| `grade = 'grade-13'` | Grade 13  |
| `grade = 'grade-14'` | Adult Education |
| `grade = 'grade-15'` | Ungraded  |
| `grade = 'grade-16'` | K-12  |
| `grade = 'grade-20'` | Grades 7 and 8  |
| `grade = 'grade-21'` | Grade 9 and 10  |
| `grade = 'grade-22'` | Grades 11 and 12  |
| `grade = 'grade-99'` | Total  |

### Level of Study
| Filter Argument | Level of Study |
|-------------------|----------------| 
| `level_of_study = 'undergraduate'` | Undergraduate |
| `level_of_study = 'graduate'` | Graduate |
| `level_of_study = 'first-professional'` | First Professional |
| `level_of_study = 'post-baccalaureate'` | Post-baccalaureate |
| `level_of_study = '99'` | Total |


## Examples

Let's build up some examples, from the following set of endpoints.

```{r example-endpoints, echo = FALSE}
df <- df[df$section == 'schools' & df$topic == 'enrollment', ]

knitr::kable(df[vars], 
             col.names = c('Level', 
                           'Source', 
                           'Topic', 
                           'By',
                           'Main Filters',
                           'Years Available'),
             row.names = FALSE)
```


The following will return a `data.frame` across all years and grades:

```{r example-01, eval=FALSE}
library(educationdata)
df <- get_education_data(level = 'schools', 
                         source = 'ccd', 
                         topic = 'enrollment')
```

Note that this endpoint is also callable `by` certain variables:

* race
* sex
* race, sex

These variables can be added to the `by` argument:

```{r example-02, eval=FALSE}
df <- get_education_data(level = 'schools', 
                         source = 'ccd', 
                         topic = 'enrollment', 
                         by = list('race', 'sex'))
```

You may also filter the results of an API call. In this case `year` and 
`grade` will provide the most time-efficient subsets, and can be vectorized:

```{r example-03, eval=FALSE}
df <- get_education_data(level = 'schools', 
                         source = 'ccd', 
                         topic = 'enrollment', 
                         by = list('race', 'sex'),
                         filters = list(year = 1988:1990,
                                        grade = 6:8))
```

Additional variables can also be passed to `filters` to subset further:

```{r example-04, eval=FALSE}
df <- get_education_data(level = 'schools', 
                         source = 'ccd', 
                         topic = 'enrollment', 
                         by = list('race', 'sex'),
                         filters = list(year = 1988:1990,
                                        grade = 6:8,
                                        ncessch = '010000200277'))
```

Finally, the `add_labels` flag will map variables to a `factor` from their 
labels in the API.

```{r example-05, eval=FALSE}
df <- get_education_data(level = 'schools', 
                         source = 'ccd', 
                         topic = 'enrollment', 
                         by = list('race', 'sex'),
                         filters = list(year = 1988:1990,
                                        grade = 6:8,
                                        ncessch = '010000200277'),
                         add_labels = TRUE)
```

Finally, the `csv` flag can be set to download the full `.csv` data frame. In 
general, the `csv` functionality is much faster when retrieving the full data 
frame (or a large subset) and much slower when retrieving a small subset of a 
data frame (especially ones with a lot of `filters` added). In this example, 
the full `csv` for 2008 must be downloaded and then subset to the 96 
observations.

```{r example-06, eval=FALSE}
df <- get_education_data(level = 'schools', 
                         source = 'ccd', 
                         topic = 'enrollment', 
                         by = list('race', 'sex'),
                         filters = list(year = 1988:1990,
                                        grade = 6:8,
                                        ncessch = '010000200277'),
                         add_labels = TRUE,
                         csv = TRUE)
```