09-cs02-intro

Professor Shannon Ellis

UC San Diego
COGS 137 - Fall 2024

2024-11-05

Q&A

Q: I’m not entirely sure how to balance conveying all the useful information from the study to general audiences without over or under-explaining details. I feel like I’m going to get this done via practice.
A: This is absolutely why we do two case studies! We don’t expect you to get it perfect this first go around…but we also grade with that in mind and give feedback! We do expect students to improve by CS02 and incorporate our feedback going forward.

Q: I think it’s just confusing thinking about how to naturally intertwine the extension into meaningful analysis.
A: This is the harder thing and is very group specific. We’re happy to help as your group comes up with ideas.

Q: Not sure what to focus on for answering our main question
A: General advice: Use EDA to eliminate as many compounds as possible - which ones are definitely NOT good biomarkers of recent use. Then, once you have fewer matrix-compound combos to look at, do calculations on those remaining.

Q: How you would present some of your research to the public, what would you personally prioritize.
A: The most critical information while ALSO discussing the necessary nuance. For example, my fear with this work is that more people would be sent to jail/see consequences that should NOT. So, I wanted to always be sure to communicate findings as well as how people should NOT misinterpret/misuse these findings.

Course Announcements

Due Dates:

💻 CS01 due Wednesday
🔬 Lab 05 due Thursday
🔘 CS01 Team Eval survey due Thursday
📄 Final Project Proposal due Monday
📋 Lecture Participation survey “due” after class

Notes:

hw02 and lab04 feedback and scores posted
mid-course survey open until tonight

Mid-course survey (N=55; 57%)

Generally, things are going well, pacing is ok, and people are getting their questions answered.

Some great suggestions:

Give us time in class to meet groups right after assignment
Keep instructions in same document for completion
More text in notes
More cs01 guidance/time
include some details about R in the wild

Planning Ahead

Group Formation (CS02 + Final Project)

CS02 and Final Project are completed in groups (4-5) that you choose
- CS02 - we choose the topic
- Final project - you choose the topic
You specify that group by submitting your final project proposal (due Monday)
I will create repos for CS02 and final project after proposals are submitted
You will still submit feedback about your experience working with your group

The final project proposal link

Final Project

Carry out a data analysis on a topic and dataset of your choosing

not something that has already been done
pick a topic you care about
asking and answering a small-ish, specific question OR building a model to complete a useful task

Components

Proposal | week 7
Rough Draft of report | week 9
Peer Feedback | week 10
Final Deliverable | Tues of finals week
- report (.Rmd -> html)
- 3-5 min presentation*
- general audience communication

Notes:

Groups will have the option to record their presentation or present in class during week 10 (EC; all group members must present)
CS02 is also due week 9; these will be happening concurrently; that’s why you’re in the same group

Topic Brainstorming

Let’s brainstorm some topics together…

CS02: Predicting Air Pollution (Data)

Agenda

Background (pollutants, health impacts)
Question
Data Intro

Background

Why?

Why do we discuss background on case studies?

OpenCaseStudies

Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com//opencasestudies/ocs-bp-air-pollution. Predicting Annual Air Pollution (Version v1.0.0).

Air Pollutants

Some sources are natural while others are anthropogenic (human-derived):

[source]

Major types of air pollutants

Gaseous - Carbon Monoxide (CO), Ozone (O₃), Nitrogen Oxides(NO, NO₂), Sulfur Dioxide (SO₂)
Particulate - small liquids and solids suspended in the air (includes lead- can include certain types of dust)
Dust - small solids (larger than particulates) that can be suspended in the air for some time but eventually settle
Biological - pollen, bacteria, viruses, mold spores

See here for more detail on the types of pollutants in the air.

Particulate Pollution

Air pollution particulates are generally described by their size:

Large Coarse Particulate Matter - has diameter of >10 micrometers (10 µm)
Coarse Particulate Matter (called PM_10-2.5) - has diameter of between 2.5 µm and 10 µm
Fine Particulate Matter (called PM_2.5) - has diameter of < 2.5 µm

PM₁₀ includes any particulate matter <10 µm (both coarse and fine particulate matter)

In relation to a piece of human hair:

[source]

Which do you think is most detrimental to our health? (and why?)

Common Pollutants and their size

[source]

Penetration into the human body

[source]

Negative Health Impacts

Exposure to air pollution is:

associated with higher rates of mortality in older adults
known to be a risk factor for many diseases and conditions including (but not limited to):

Asthma - fine particle exposure (PM_2.5) was found to be associated with higher rates of asthma in children
Inflammation in type 1 diabetes - fine particle exposure (PM_2.5) from traffic-related air pollution was associated with increased measures of inflammatory markers in youths with Type 1 diabetes
Lung function and emphysema - higher concentrations of ozone (O₃), nitrogen oxides (NO_x), black carbon, and fine particle exposure PM_2.5 , at study baseline were significantly associated with greater increases in percent emphysema per 10 years
Low birthweight - fine particle exposure(PM_2.5) was associated with lower birth weight in full-term live births
Viral Infection - higher rates of infection and increased severity of infection are associated with higher exposures to pollution levels including fine particle exposure (PM_2.5)

See this review article for more information about sources of air pollution and the influence of air pollution on health.

Why to Care: The State of Global Air

The State of Global Air is a report released every year to communicate the impact of air pollution on public health.

The State of Global Air 2019 report (which uses data from 2017) stated that:

Air pollution is the fifth leading risk factor for mortality worldwide. It is responsible for more deaths than many better-known risk factors such as malnutrition, alcohol use, and physical inactivity. Each year, more people die from air pollution–related disease than from road traffic injuries or malaria.

[source]

In 2017, air pollution is estimated to have contributed to close to 5 million deaths globally — nearly 1 in every 10 deaths.

[source]

The State of Global Air 2018 report (using data from 2016) separated different types of air pollution & found that particulate pollution was particularly associated with mortality.

[source]

The 2019 report shows that the highest levels of fine particulate pollution occur in Africa and Asia and that:

More than 90% of people worldwide live in areas exceeding the World Health Organization (WHO) Guideline for healthy air. More than half live in areas that do not even meet WHO’s least-stringent air quality target.

[source]

Overall Improvement

Looking at the US specifically, air pollution levels are generally improving, with declining national air pollutant concentration averages as shown from the 2019 Our Nation’s Air report from the US Environmental Protection Agency (EPA):

[source]

Still An Issue

air pollution continues to contribute to health risk for Americans, in particular in regions with higher than national average rates of pollution that, at times, exceed the WHO’s recommended level.
important to obtain high spatial granularity in estimates of air pollution in order to identify locations where populations are experiencing harmful levels of exposure.

You can see that current air quality conditions at this website, and you will notice variation across different cities.

reports particulate values using what is called the Air Quality Index (AQI).
This calculator indicates that 138 AQI is equivalent to 50.5 ug/m³ and is considered unhealthy for sensitive individuals.
Thus, some areas exceed the WHO annual exposure guideline (10 ug/m³), and this may adversely affect the health of people living in these locations.

Adverse health effects

Adverse health effects have been associated with populations experiencing higher pollution exposure despite the levels being below suggested guidelines.
it appears that the composition of the particulate matter and the influence of other demographic factors may make specific populations more at risk for adverse health effects due to air pollution. (For example, see this article for more details.)

Sparse monitoring PH issue

Historically, epidemiological studies would assess the influence of air pollution on health outcomes by relying on a number of monitors located around the country.
However, these monitors are relatively sparse in certain regions of the country and are not necessarily located near pollution sources.
dramatic differences in pollution rates can be seen even within the same city. (In fact, the term micro-environments describes environments within cities or counties which may vary greatly from one block to another.)

[source]

Lack of granularity in air pollution monitoring has hindered our ability to discern the full impact of air pollution on health and to identify at-risk locations.

Machine Learning offers a solution

An article published in the Environmental Health journal dealt with this issue by using data, including population density and road density, among other features, to model or predict air pollution levels at a more localized scale using machine learning (ML) methods.

[source]

The authors of this article state that:

“Exposure to atmospheric particulate matter (PM) remains an important public health concern, although it remains difficult to quantify accurately across large geographic areas with sufficiently high spatial resolution. Recent epidemiologic analyses have demonstrated the importance of spatially- and temporally-resolved exposure estimates, which show larger PM-mediated health effects as compared to nearest monitor or county-specific ambient concentrations.”

The article above demonstrates that machine learning methods can be used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems.

So…we’re going to do the same

Question

With what accuracy can we predict US annual average air pollution concentrations?

This will specifically be at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data.

Data

Monitor Data

Monitor data in this case study come from a system of monitors in which roughly 90% are located within cities.
There is an equity issue in terms of capturing the air pollution levels of more rural areas.
To get a better sense of the pollution exposures for the individuals living in these areas, methods like machine learning can be useful to estimate air pollution levels in areas with little to no monitoring.
Specifically, these methods can be used to estimate air pollution in these low monitoring areas so that we can make a map like this where we have annual estimates for all of the contiguous US:

[source]

This is what we aim to achieve in this case study.

Limitations

The data do not include information about the composition of particulate matter. Different types of particulates may be more benign or deleterious for health outcomes.
Outdoor pollution levels are not necessarily an indication of individual exposures. People spend differing amounts of time indoors and outdoors and are exposed to different pollution levels indoors. Researchers are now developing personal monitoring systems to track air pollution levels on the personal level.
Our analysis will use annual mean estimates of pollution levels, but these can vary greatly by season, day and even hour. There are data sources that have finer levels of temporal data; however, we are interested in long term exposures, as these appear to be the most influential for health outcomes.
These data are US-focused.

Supervised ML

Here, we’ll need:

A continuous outcome variable that we want to predict
A set of feature(s) (or predictor variables) that we use to predict the outcome variable

To build (or train) our model, we use both the outcome and features.

The goal is to identify informative features that can explain a large amount of variation in our outcome variable.

Using this model, we can then predict the outcome from new observations with the same features where have not observed the outcome.

(More details here)

Outcome

The monitor data that we will be using comes from gravimetric monitors (see picture below) operated by the US Environmental Protection Agency (EPA).

[image courtesy of Kirsten Koehler]

These monitors use a filtration system to specifically capture fine particulate matter.

[source]

The weight of this particulate matter is manually measured daily or weekly.

For the EPA standard operating procedure for PM gravimetric analysis in 2008, we refer the reader to here.

In our data set, the value column indicates the PM_2.5 monitor average for 2008 in mass of fine particles/volume of air for 876 gravimetric monitors.

The units are micrograms of fine particulate matter (PM) that is less than 2.5 micrometers in diameter per cubic meter of air - mass concentration (ug/m³).

Recall the WHO exposure guideline is < 10 ug/m³ on average annually for PM_2.5.

Data Import

All of our data was previously collected by a researcher at the Johns Hopkins School of Public Health who studies air pollution and climate change. (Roger now works at UT Austin)

We have one CSV file that contains both our single outcome variable and all of our features (or predictor variables). You can download this file using the OCSdata package:

# install.packages("OCSdata")
OCSdata::raw_data("ocs-bp-air-pollution", outpath = getwd())

pm <- read_csv("OCS_data/data/raw/pm25_data.csv")

Get The Data

Read the data in so you can follow along now!

PM 2.5 Data

876 monitors
40 columns
- value | outcome variable

pm |>
  glimpse()

Rows: 876
Columns: 50
$ id                          <dbl> 1003.001, 1027.000, 1033.100, 1049.100, 10…
$ value                       <dbl> 9.597647, 10.800000, 11.212174, 11.659091,…
$ fips                        <dbl> 1003, 1027, 1033, 1049, 1055, 1069, 1073, …
$ lat                         <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 33…
$ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.96830…
$ state                       <chr> "Alabama", "Alabama", "Alabama", "Alabama"…
$ county                      <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "E…
$ city                        <chr> "Fairhope", "Ashland", "Muscle Shoals", "C…
$ CMAQ                        <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9.…
$ zcta                        <dbl> 36532, 36251, 35660, 35962, 35901, 36303, …
$ zcta_area                   <dbl> 190980522, 374132430, 16716984, 203836235,…
$ zcta_pop                    <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 901…
$ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 5.782…
$ imp_a1000                   <dbl> 1.4096021, 0.8531574, 11.1448962, 3.867647…
$ imp_a5000                   <dbl> 3.3360118, 0.9851479, 15.1786154, 1.231141…
$ imp_a10000                  <dbl> 1.9879187, 0.5208189, 9.7253870, 1.0316469…
$ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 0.9730444…
$ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 201266…
$ county_pop                  <dbl> 182265, 13932, 54428, 71109, 104430, 10154…
$ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5.…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9.…
$ log_pri_length_10000        <dbl> 9.210340, 9.210340, 9.274303, 10.409411, 1…
$ log_pri_length_15000        <dbl> 9.630228, 9.615805, 9.658899, 11.173626, 1…
$ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 12…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8.…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9.…
$ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 10.214200…
$ log_prisec_length_10000     <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 12…
$ log_prisec_length_15000     <dbl> 12.205723, 12.042963, 13.282656, 12.353663…
$ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 13…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.0…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.2…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4.…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.3500…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4.…
$ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 35.330814,…
$ popdens_zcta                <dbl> 145.716431, 13.639555, 540.887040, 40.7189…
$ nohs                        <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7, …
$ somehs                      <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, 6…
$ hs                          <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2, …
$ somecollege                 <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5, …
$ associate                   <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, 4…
$ bachelor                    <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17.…
$ grad                        <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, 2…
$ pov                         <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7.…
$ hs_orless                   <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4, …
$ urc2013                     <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ urc2006                     <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ aod                         <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 43…

There are 48 features with values for each of the 876 monitors (observations).

The data comes from the US Environmental Protection Agency (EPA), the National Aeronautics and Space Administration (NASA), the US Census, and the National Center for Health Statistics (NCHS).

Features

Variable	Details
id	Monitor number – the county number is indicated before the decimal – the monitor number is indicated after the decimal Example: 1073.0023 is Jefferson county (1073) and .0023 one of 8 monitors
fips	Federal information processing standard number for the county where the monitor is located – 5 digit id code for counties (zero is often the first value and sometimes is not shown) – the first 2 numbers indicate the state – the last three numbers indicate the county Example: Alabama’s state code is 01 because it is first alphabetically (note: Alaska and Hawaii are not included because they are not part of the contiguous US)
Lat	Latitude of the monitor in degrees
Lon	Longitude of the monitor in degrees
state	State where the monitor is located
county	County where the monitor is located
city	City where the monitor is located
CMAQ	Estimated values of air pollution from a computational model called Community Multiscale Air Quality (CMAQ) – A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution – *Does not use any of the PM_2.5 gravimetric monitoring data.* (There is a version that does use the gravimetric monitoring data, but not this one!) – Data from the EPA
zcta	Zip Code Tabulation Area where the monitor is located – Postal Zip codes are converted into “generalized areal representations” that are non-overlapping – Data from the 2010 Census
zcta_area	Land area of the zip code area in meters squared – Data from the 2010 Census
zcta_pop	Population in the zip code area – Data from the 2010 Census
imp_a500	Impervious surface measure – Within a circle with a radius of 500 meters around the monitor – Impervious surface are roads, concrete, parking lots, buildings – This is a measure of development
imp_a1000	Impervious surface measure – Within a circle with a radius of 1000 meters around the monitor
imp_a5000	Impervious surface measure – Within a circle with a radius of 5000 meters around the monitor
imp_a10000	Impervious surface measure – Within a circle with a radius of 10000 meters around the monitor
imp_a15000	Impervious surface measure – Within a circle with a radius of 15000 meters around the monitor
county_area	Land area of the county of the monitor in meters squared
county_pop	Population of the county of the monitor
Log_dist_to_prisec	Log (Natural log) distance to a primary or secondary road from the monitor – Highway or major road
log_pri_length_5000	Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log) – Highways only
log_pri_length_10000	Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log) – Highways only
log_pri_length_15000	Count of primary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log) – Highways only
log_pri_length_25000	Count of primary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log) – Highways only
log_prisec_length_500	Count of primary and secondary road length in meters in a circle with a radius of 500 meters around the monitor (Natural log) – Highway and secondary roads
log_prisec_length_1000	Count of primary and secondary road length in meters in a circle with a radius of 1000 meters around the monitor (Natural log) – Highway and secondary roads
log_prisec_length_5000	Count of primary and secondary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log) – Highway and secondary roads
log_prisec_length_10000	Count of primary and secondary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log) – Highway and secondary roads
log_prisec_length_15000	Count of primary and secondary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log) – Highway and secondary roads
log_prisec_length_25000	Count of primary and secondary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log) – Highway and secondary roads
log_nei_2008_pm25_sum_10000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
log_nei_2008_pm25_sum_15000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
log_nei_2008_pm25_sum_25000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_10000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_15000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_25000	Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
popdens_county	Population density (number of people per kilometer squared area of the county)
popdens_zcta	Population density (number of people per kilometer squared area of zcta)
nohs	Percentage of people in zcta area where the monitor is that do not have a high school degree – Data from the Census
somehs	Percentage of people in zcta area where the monitor whose highest formal educational attainment was some high school education – Data from the Census
hs	Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a high school degree – Data from the Census
somecollege	Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing some college education – Data from the Census
associate	Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an associate degree – Data from the Census
bachelor	Percentage of people in zcta area where the monitor whose highest formal educational attainment was a bachelor’s degree – Data from the Census
grad	Percentage of people in zcta area where the monitor whose highest formal educational attainment was a graduate degree – Data from the Census
pov	Percentage of people in zcta area where the monitor is that lived in poverty in 2008 - or would it have been 2007 guidelines??https://aspe.hhs.gov/2007-hhs-poverty-guidelines – Data from the Census
hs_orless	Percentage of people in zcta area where the monitor whose highest formal educational attainment was a high school degree or less (sum of nohs, somehs, and hs)
urc2013	2013 Urban-rural classification of the county where the monitor is located – 6 category variable - 1 is totally urban 6 is completely rural – Data from the National Center for Health Statistics
urc2006	2006 Urban-rural classification of the county where the monitor is located – 6 category variable - 1 is totally urban 6 is completely rural – Data from the National Center for Health Statistics
aod	Aerosol Optical Depth measurement from a NASA satellite – based on the diffraction of a laser – used as a proxy of particulate pollution – unit-less - higher value indicates more pollution – Data from NASA

Many of these features have to do with the circular area around the monitor called the “buffer”. These are illustrated in the following figure:

[source]

Your Turn

Scroll through the data, consider 1) if these data are in tidy format and 2) which variables you think may be most helpful in building a model to predict annual air pollution.