UC San Diego
COGS 137 - Fall 2024
2024-11-05
Q: I’m not entirely sure how to balance conveying all the useful information from the study to general audiences without over or under-explaining details. I feel like I’m going to get this done via practice.
A: This is absolutely why we do two case studies! We don’t expect you to get it perfect this first go around…but we also grade with that in mind and give feedback! We do expect students to improve by CS02 and incorporate our feedback going forward.
Q: I think it’s just confusing thinking about how to naturally intertwine the extension into meaningful analysis.
A: This is the harder thing and is very group specific. We’re happy to help as your group comes up with ideas.
Q: Not sure what to focus on for answering our main question
A: General advice: Use EDA to eliminate as many compounds as possible - which ones are definitely NOT good biomarkers of recent use. Then, once you have fewer matrix-compound combos to look at, do calculations on those remaining.
Q: How you would present some of your research to the public, what would you personally prioritize.
A: The most critical information while ALSO discussing the necessary nuance. For example, my fear with this work is that more people would be sent to jail/see consequences that should NOT. So, I wanted to always be sure to communicate findings as well as how people should NOT misinterpret/misuse these findings.
Due Dates:
Notes:
Generally, things are going well, pacing is ok, and people are getting their questions answered.
Some great suggestions:
The final project proposal link
Carry out a data analysis on a topic and dataset of your choosing
Components
Notes:
Let’s brainstorm some topics together…
Why do we discuss background on case studies?
Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com//opencasestudies/ocs-bp-air-pollution. Predicting Annual Air Pollution (Version v1.0.0).
Some sources are natural while others are anthropogenic (human-derived):
See here for more detail on the types of pollutants in the air.
Air pollution particulates are generally described by their size:
Large Coarse Particulate Matter - has diameter of >10 micrometers (10 µm)
Coarse Particulate Matter (called PM10-2.5) - has diameter of between 2.5 µm and 10 µm
Fine Particulate Matter (called PM2.5) - has diameter of < 2.5 µm
PM10 includes any particulate matter <10 µm (both coarse and fine particulate matter)
In relation to a piece of human hair:
Which do you think is most detrimental to our health? (and why?)
Exposure to air pollution is:
See this review article for more information about sources of air pollution and the influence of air pollution on health.
The State of Global Air is a report released every year to communicate the impact of air pollution on public health.
The State of Global Air 2019 report (which uses data from 2017) stated that:
Air pollution is the fifth leading risk factor for mortality worldwide. It is responsible for more deaths than many better-known risk factors such as malnutrition, alcohol use, and physical inactivity. Each year, more people die from air pollution–related disease than from road traffic injuries or malaria.
In 2017, air pollution is estimated to have contributed to close to 5 million deaths globally — nearly 1 in every 10 deaths.
The State of Global Air 2018 report (using data from 2016) separated different types of air pollution & found that particulate pollution was particularly associated with mortality.
The 2019 report shows that the highest levels of fine particulate pollution occur in Africa and Asia and that:
More than 90% of people worldwide live in areas exceeding the World Health Organization (WHO) Guideline for healthy air. More than half live in areas that do not even meet WHO’s least-stringent air quality target.
Looking at the US specifically, air pollution levels are generally improving, with declining national air pollutant concentration averages as shown from the 2019 Our Nation’s Air report from the US Environmental Protection Agency (EPA):
You can see that current air quality conditions at this website, and you will notice variation across different cities.
Lack of granularity in air pollution monitoring has hindered our ability to discern the full impact of air pollution on health and to identify at-risk locations.
An article published in the Environmental Health journal dealt with this issue by using data, including population density and road density, among other features, to model or predict air pollution levels at a more localized scale using machine learning (ML) methods.
The authors of this article state that:
“Exposure to atmospheric particulate matter (PM) remains an important public health concern, although it remains difficult to quantify accurately across large geographic areas with sufficiently high spatial resolution. Recent epidemiologic analyses have demonstrated the importance of spatially- and temporally-resolved exposure estimates, which show larger PM-mediated health effects as compared to nearest monitor or county-specific ambient concentrations.”
The article above demonstrates that machine learning methods can be used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems.
So…we’re going to do the same
With what accuracy can we predict US annual average air pollution concentrations?
This will specifically be at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data.
Here, we’ll need:
To build (or train) our model, we use both the outcome and features.
The goal is to identify informative features that can explain a large amount of variation in our outcome variable.
Using this model, we can then predict the outcome from new observations with the same features where have not observed the outcome.
(More details here)
The monitor data that we will be using comes from gravimetric monitors (see picture below) operated by the US Environmental Protection Agency (EPA).
[image courtesy of Kirsten Koehler]
These monitors use a filtration system to specifically capture fine particulate matter.
The weight of this particulate matter is manually measured daily or weekly.
For the EPA standard operating procedure for PM gravimetric analysis in 2008, we refer the reader to here.
In our data set, the value
column indicates the PM2.5 monitor average for 2008 in mass of fine particles/volume of air for 876 gravimetric monitors.
The units are micrograms of fine particulate matter (PM) that is less than 2.5 micrometers in diameter per cubic meter of air - mass concentration (ug/m3).
Recall the WHO exposure guideline is < 10 ug/m3 on average annually for PM2.5.
All of our data was previously collected by a researcher at the Johns Hopkins School of Public Health who studies air pollution and climate change. (Roger now works at UT Austin)
We have one CSV file that contains both our single outcome variable and all of our features (or predictor variables). You can download this file using the OCSdata
package:
Read the data in so you can follow along now!
value
| outcome variableRows: 876
Columns: 50
$ id <dbl> 1003.001, 1027.000, 1033.100, 1049.100, 10…
$ value <dbl> 9.597647, 10.800000, 11.212174, 11.659091,…
$ fips <dbl> 1003, 1027, 1033, 1049, 1055, 1069, 1073, …
$ lat <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 33…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.96830…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama"…
$ county <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "E…
$ city <chr> "Fairhope", "Ashland", "Muscle Shoals", "C…
$ CMAQ <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9.…
$ zcta <dbl> 36532, 36251, 35660, 35962, 35901, 36303, …
$ zcta_area <dbl> 190980522, 374132430, 16716984, 203836235,…
$ zcta_pop <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 901…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 5.782…
$ imp_a1000 <dbl> 1.4096021, 0.8531574, 11.1448962, 3.867647…
$ imp_a5000 <dbl> 3.3360118, 0.9851479, 15.1786154, 1.231141…
$ imp_a10000 <dbl> 1.9879187, 0.5208189, 9.7253870, 1.0316469…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 0.9730444…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 201266…
$ county_pop <dbl> 182265, 13932, 54428, 71109, 104430, 10154…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5.…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9.…
$ log_pri_length_10000 <dbl> 9.210340, 9.210340, 9.274303, 10.409411, 1…
$ log_pri_length_15000 <dbl> 9.630228, 9.615805, 9.658899, 11.173626, 1…
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 12…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8.…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9.…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 10.214200…
$ log_prisec_length_10000 <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 12…
$ log_prisec_length_15000 <dbl> 12.205723, 12.042963, 13.282656, 12.353663…
$ log_prisec_length_25000 <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 13…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.0…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.2…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4.…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.3500…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4.…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 35.330814,…
$ popdens_zcta <dbl> 145.716431, 13.639555, 540.887040, 40.7189…
$ nohs <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7, …
$ somehs <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, 6…
$ hs <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2, …
$ somecollege <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5, …
$ associate <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, 4…
$ bachelor <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17.…
$ grad <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, 2…
$ pov <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7.…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4, …
$ urc2013 <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ urc2006 <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ aod <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 43…
There are 48 features with values for each of the 876 monitors (observations).
The data comes from the US Environmental Protection Agency (EPA), the National Aeronautics and Space Administration (NASA), the US Census, and the National Center for Health Statistics (NCHS).
Variable | Details |
---|---|
id | Monitor number – the county number is indicated before the decimal – the monitor number is indicated after the decimal Example: 1073.0023 is Jefferson county (1073) and .0023 one of 8 monitors |
fips | Federal information processing standard number for the county where the monitor is located – 5 digit id code for counties (zero is often the first value and sometimes is not shown) – the first 2 numbers indicate the state – the last three numbers indicate the county Example: Alabama’s state code is 01 because it is first alphabetically (note: Alaska and Hawaii are not included because they are not part of the contiguous US) |
Lat | Latitude of the monitor in degrees |
Lon | Longitude of the monitor in degrees |
state | State where the monitor is located |
county | County where the monitor is located |
city | City where the monitor is located |
CMAQ | Estimated values of air pollution from a computational model called Community Multiscale Air Quality (CMAQ) – A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution – Does not use any of the PM2.5 gravimetric monitoring data. (There is a version that does use the gravimetric monitoring data, but not this one!) – Data from the EPA |
zcta | Zip Code Tabulation Area where the monitor is located – Postal Zip codes are converted into “generalized areal representations” that are non-overlapping – Data from the 2010 Census |
zcta_area | Land area of the zip code area in meters squared – Data from the 2010 Census |
zcta_pop | Population in the zip code area – Data from the 2010 Census |
imp_a500 | Impervious surface measure – Within a circle with a radius of 500 meters around the monitor – Impervious surface are roads, concrete, parking lots, buildings – This is a measure of development |
imp_a1000 | Impervious surface measure – Within a circle with a radius of 1000 meters around the monitor |
imp_a5000 | Impervious surface measure – Within a circle with a radius of 5000 meters around the monitor |
imp_a10000 | Impervious surface measure – Within a circle with a radius of 10000 meters around the monitor |
imp_a15000 | Impervious surface measure – Within a circle with a radius of 15000 meters around the monitor |
county_area | Land area of the county of the monitor in meters squared |
county_pop | Population of the county of the monitor |
Log_dist_to_prisec | Log (Natural log) distance to a primary or secondary road from the monitor – Highway or major road |
log_pri_length_5000 | Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log) – Highways only |
log_pri_length_10000 | Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log) – Highways only |
log_pri_length_15000 | Count of primary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log) – Highways only |
log_pri_length_25000 | Count of primary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log) – Highways only |
log_prisec_length_500 | Count of primary and secondary road length in meters in a circle with a radius of 500 meters around the monitor (Natural log) – Highway and secondary roads |
log_prisec_length_1000 | Count of primary and secondary road length in meters in a circle with a radius of 1000 meters around the monitor (Natural log) – Highway and secondary roads |
log_prisec_length_5000 | Count of primary and secondary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log) – Highway and secondary roads |
log_prisec_length_10000 | Count of primary and secondary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log) – Highway and secondary roads |
log_prisec_length_15000 | Count of primary and secondary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log) – Highway and secondary roads |
log_prisec_length_25000 | Count of primary and secondary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log) – Highway and secondary roads |
log_nei_2008_pm25_sum_10000 | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log) |
log_nei_2008_pm25_sum_15000 | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log) |
log_nei_2008_pm25_sum_25000 | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log) |
log_nei_2008_pm10_sum_10000 | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log) |
log_nei_2008_pm10_sum_15000 | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log) |
log_nei_2008_pm10_sum_25000 | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log) |
popdens_county | Population density (number of people per kilometer squared area of the county) |
popdens_zcta | Population density (number of people per kilometer squared area of zcta) |
nohs | Percentage of people in zcta area where the monitor is that do not have a high school degree – Data from the Census |
somehs | Percentage of people in zcta area where the monitor whose highest formal educational attainment was some high school education – Data from the Census |
hs | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a high school degree – Data from the Census |
somecollege | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing some college education – Data from the Census |
associate | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an associate degree – Data from the Census |
bachelor | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a bachelor’s degree – Data from the Census |
grad | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a graduate degree – Data from the Census |
pov | Percentage of people in zcta area where the monitor is that lived in poverty in 2008 - or would it have been 2007 guidelines??https://aspe.hhs.gov/2007-hhs-poverty-guidelines – Data from the Census |
hs_orless | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a high school degree or less (sum of nohs, somehs, and hs) |
urc2013 | 2013 Urban-rural classification of the county where the monitor is located – 6 category variable - 1 is totally urban 6 is completely rural – Data from the National Center for Health Statistics |
urc2006 | 2006 Urban-rural classification of the county where the monitor is located – 6 category variable - 1 is totally urban 6 is completely rural – Data from the National Center for Health Statistics |
aod | Aerosol Optical Depth measurement from a NASA satellite – based on the diffraction of a laser – used as a proxy of particulate pollution – unit-less - higher value indicates more pollution – Data from NASA |
Many of these features have to do with the circular area around the monitor called the “buffer”. These are illustrated in the following figure:
Scroll through the data, consider 1) if these data are in tidy format and 2) which variables you think may be most helpful in building a model to predict annual air pollution.