There's a lot of trash (or so I've been told). Unless you are Oscar the Grouch and literally live in a trash can, you probably don't like trash. On that note, does Oscar the Grouch even like trash? Maybe living in a trash can is why he's so grouchy (ha, see what I did there?). Maybe if he just tried to live a little more like Cookie Monster then things would be different.
But wait - you're never going to see this plot twist coming. What if all of the trash from Cookie Monster's cookies is what Oscar the Grouch lives in? Are there feuds between them on Sesame Street? I have no idea. But what I do know that no one likes trash, and that's the important point.
New York City is a large city. After all, it's called the Big Apple, not the Medium Apple and certainly not the Little Apple. That would be outrageous. As such, it produces a Big Apple amount of trash. Naturally, such large amounts of trash also produce large amounts of data, just like these large amounts of words will hopefully get me an A on this project.
Analyzing this data, then, is exceedingly important. Firstly, it can help manage ongoing space issues regarding waste management. In places like New York City that simply do not have the space for landfills, knowing exactly what is being thrown away and being able to predict with relative accuracy what will be thrown away is extremely important. Furthermore, understanding what is being thrown away can help in determining how effective various environmental efforts are, and what aspects of those efforts need to be worked on in order to create a greener future.
The data in this project was provided by the New York Department of Sanitation (DSNY), and I kindly thank them for collecting and curating it.
The data can be downloaded at the following link:
https://catalog.data.gov/dataset/dsny-waste-characterization-comparative-results
For further details on the specifics of the data, please visit the following link:
https://data.cityofnewyork.us/dataset/DSNY-Waste-Characterization-Comparative-Results/bibp-6ff7/about_data
HTML generated using the following site:
https://htmtopdf.herokuapp.com/ipynbviewer/
Further reading:
https://medium.com/@rajeshwari2310/data-science-applications-in-waste-management-94d951f6f22a
https://www.analyticssteps.com/blogs/4-applications-big-data-waste-management
https://climate.cityofnewyork.us/subtopics/waste/#:~:text=New%20Yorkers%20produce%20nearly%20four,also%20contributes%20to%20climate%20change
https://news.climate.columbia.edu/2021/04/27/new-york-city-trash-dilemmas-opportunities/
Firstly, there isn't really much that can be done without the use of libraries. Or, I guess you could do everything without libraries, but why? People have already made these libraries for a reason, so let's use them.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smfa
# suppress warnings to avoid unnecessary output clutter
import warnings
warnings.filterwarnings("ignore")
Now comes the time to load in the data. I pre-downloaded the data for convenience's sake. It is available at the following link:
https://catalog.data.gov/dataset/dsny-waste-characterization-comparative-results
Immediately displaying the table will help to identify certain issues with the data that will need to be addressed later.
For the sake of saving space, I will only use df.head()
here. However, larger ranges can and
should be used instead or entirely different row selections can be used in order to get a better grasp on
what needs to be fixed in the data. Things to look for are duplicate data, missing values, inconsistent
labels, and unnecessary columns.
df = pd.read_csv("./DSNY_Waste_Characterization_-_Comparative_Results.csv")
df.head()
Upon reviewing different parts of the table with different techniques as discussed above, there is some messiness in the data. Unknown values appear to be denoted as '-' in the table. This will be... unfortunate... if it isn't fixed, because '-' cannot be cast to an integer, float, or other number-like type. Mixing numbers and strings would make things exceedingly difficult going forward, so at this point '-' needs to be replaced with Numpy's NaN value.
As a way of showing this, I have included summing the "Aggregate Percent" column in the CSV both before and after the '-' was changed to NaN. It uses error handling to show that, without changing '-' to NaN, it does not work.
# error-handling allows me to properly show that it doesn't work at first
try:
# remove percent sign, convert to float, and sum
df["Aggregate Percent"] = df["Aggregate Percent"].str.rstrip("%").astype("float")
total = df["Aggregate Percent"].sum()
except:
print("It doesn't work before changing '-' to NaN")
# replace "-" with NaN
df.replace("-", np.nan, inplace=True)
# remove percent sign, convert to float, and sum
df["Aggregate Percent"] = df["Aggregate Percent"].str.rstrip("%").astype("float")
total = df["Aggregate Percent"].sum()
print("It does work after changing '-' to NaN")
df.head()
At this point, it might also be helpful to remove the percent signs from all of the other columns that have them. This can be done more or less the same as the example from before.
Additionally, the "Refuse Percent" column appears to be formatted as the decimal value rather than the percent format. E.g., instead of 20%, it would say 0.20. This needs to be standardized as well, which is just a matter of multiplying it by 100 to match its scale with the rest of the percent columns.
Lastly for this aspect, the combined values for each year in "Aggregate Percent" do not always add up to 100, so they need to be normalized to 100. Doing it here will also allow these changes to easily propagate to the new dataframe that will be created shortly. This also has to be done for the other percent columns.
# remove percent signs from other columns and convert to floats
df["MGP (Metal, Glass, Plastic) Percent"] = df["MGP (Metal, Glass, Plastic) Percent"]. \
str.rstrip("%").astype("float")
df["Paper Percent"] = df["Paper Percent"].str.rstrip("%").astype("float")
df["Organics Percent"] = df["Organics Percent"].str.rstrip("%").astype("float")
# multiply refuse percent by 100 to scale properly with other columns
df["Refuse Percent"] = df["Refuse Percent"] * 100
# normalize each percent column to 100
for each in ["Aggregate Percent", "Refuse Percent", "MGP (Metal, Glass, Plastic) Percent", \
"Paper Percent", "Organics Percent"]:
total2005 = df.loc[df["Year"] == 2005, each].sum()
total2013 = df.loc[df["Year"] == 2013, each].sum()
total2017 = df.loc[df["Year"] == 2017, each].sum()
total2023 = df.loc[df["Year"] == 2023, each].sum()
df.loc[df["Year"] == 2005, each] = df.loc[df["Year"] == 2005, each] / total2005 * 100
df.loc[df["Year"] == 2013, each] = df.loc[df["Year"] == 2013, each] / total2013 * 100
df.loc[df["Year"] == 2017, each] = df.loc[df["Year"] == 2017, each] / total2017 * 100
df.loc[df["Year"] == 2023, each] = df.loc[df["Year"] == 2023, each] / total2023 * 100
df.head()
Another thing that needs to be changed in the "Material Group" column is "C&D". Interestingly, DSNY
had some inconsistencies, labeling some of their data as "Construction & Demolition" and some as
"C&D." To fix this, I will be be changing the "C&D" values to "Construction & Demotition."
While these rows are joined later with the rows that already had "Construction & Demolition" in their
"Material Group," those changes are only applied to simpleDf
. I do that here as well so that
this change is also applied to df
.
# convert "C&D" to "Construction & Demolition" for consistency
# group by year and material group, summing the rest of the columns
# this is done to properly rows with duplicate year and material group value combinations
df.loc[df["Material Group"] == "C&D", "Material Group"] = "Construction & Demolition"
df = df.groupby(["Year", "Material Group"]).sum().reset_index()
There is also a decent amount of data that I simply will not need. The primary culprit of this is waste
categorization - there are currently three columns for categorizing types of waste. While each has a
different level of specificity and a different purpose, the only one that is likely to be needed in this
project is the primary category, or the "Material Group" column. It is, however, important to keep a copy
of the original dataframe just in case that data is needed at some point. So for these purposes,
df
will remain as the original dataframe and simpleDf
will be the dataframe with
only one waste category. As part of putting it into one category, the values will need to be summed as
well. The "Generator..." column will necessarily have to be dropped as well, because there is no way to
accurately combine different classes in it.
An unintended side effect of this is that all of the columns fit on my screen. While this may change depending on what a given person's screen size is, it is an important consideration to make. Even if you can scroll horizontally, it can be very helpful to have a table that is small enough to fit all of the columns on one screen without scrolling.
# create a new DF before dropping columns to keep the original DF intact
# drop all columns that are not needed for the analysis
# once again group by year and material group, summing the rest of the columns
simpleDf = df.copy()
simpleDf.drop(columns=["Comparative Category", "DSNY Diversion Summary Category", \
"Generator (Residential, Schools, NYCHA)"], inplace=True)
simpleDf = simpleDf.groupby(["Year", "Material Group"]).sum().reset_index()
simpleDf.head()
The first part of analyzing this data is visualizing it. By visualizing it, trends will be more noticeable. There are practically limitless ways that this data can be visualized; this tutorial will look at the main ways that will be useful for analyzing the data, and will explore the data based on a number of factors such as trash type and time.
This section will look specifically at the data from 2005. By using simpleDf
from before, I
will be using data that I have already slightly processed, allowing me to better display the data in a way
that will not be visually overwhelming. Specifically, this graph will put percents against waste
categories, separating types of percents (aggregate, refuse, paper, etc.) as different bars.
Doing this will create a good balance of having lots of information, while at the same time still being easily understandable.
# take the 2005 data from simpleDf
df2005 = simpleDf[simpleDf["Year"] == 2005]
# only copy the columns that are needed for the plot
df2005Plot = df2005[["Aggregate Percent", "Refuse Percent", \
"MGP (Metal, Glass, Plastic) Percent", "Paper Percent", \
"Organics Percent"]]
# create the plot
df2005Plot.plot(kind="bar", figsize=(15, 5))
# create labels, title, etc.
plt.xlabel("Material Group")
plt.ylabel("Percent")
plt.title("Types of Waste in 2005")
plt.xticks(rotation=45)
# assigning to a variable to suppress unnecessary output
_ = plt.legend()
Now that I have established how to display this data, it's time to display it for the rest of the table. To do this, I can just wrap the code from the 2005 section in a for loop and iterate on the remaining years.
# wrap existing code in a for loop, iterate on the remaining years
for each in [2013, 2017, 2023]:
# take the data for the current iteration's year
tempDf = simpleDf[simpleDf["Year"] == each]
# only copy the columns that are needed for the plot
tempDfPlot = tempDf[["Aggregate Percent", "Refuse Percent", \
"MGP (Metal, Glass, Plastic) Percent", "Paper Percent", \
"Organics Percent"]]
# create the plot
tempDfPlot.plot(kind="bar", figsize=(15, 5))
# create labels, title, etc.
plt.xlabel("Material Group")
plt.ylabel("Percent")
plt.title("Types of Waste in " + str(each))
plt.xticks(rotation=45)
plt.legend()
There is already some information that can be extracted. The first thing that I noticed while looking at these graphs was that there was no data listed under "Organics" before 2017. While this is something that can certainly be seen by just looking at the table (or even at the CSV file itself), displaying it like this is a great example of the strengths of data visualization. To me, this shows the shifting priorities seen in society surrounding waste management. While there had already been pushes to increase recycling and paper use (as opposed to plastic bags, etc.) in the past, a more recent trend has been pushing toward a compostable future. Notably, this has been present at the University of Maryland, with virtually every restroom having a compost bin due to the high volume of paper towels that are used daily.
By including organics in their 2017 study, DSNY further recognized the growing importance of organic waste and compost in waste management and, especially in dense areas such as New York City, in dealing with such large amounts of waste.
Another important aspect to look at is how the overall proportions of waste have changed over time. Fortunately, there's really only one plot that has to be made this time, as opposed to four with the previous graphs.
This graph basically shows the proportions of waste in each year. In this case, I think using a line graph is a better idea, and will better show how each type of waste compares to other types of waste. A scatter plot may be another good choice (and I will make one later with different data), but it wouldn't show the trends as will because it does not have lines that grab the reader's attention.
# copy simpleDf to allow manipulation without altering simpleDf
# pivot to make rows years and columns material groups
# this puts it in the right format for the line plot function
# create the line plot
trashYearDf = simpleDf.copy()
trashYearDf = trashYearDf.pivot(index="Year", columns="Material Group", values="Aggregate Percent")
trashYearDf.plot(kind="line", figsize=(15, 5))
# create labels, title, etc.
# explicitly stating [2005, 2013, 2017, 2023] as the x-axis values in xticks() avoids issues with
# displaying things like 2005.5 that would be present if I didn't
plt.xlabel("Year")
plt.ylabel("Aggregate Percent")
plt.xticks([2005, 2013, 2017, 2023], rotation=45)
# assigning to a variable to suppress unnecessary output
_ = plt.title("Proportions of Waste Over Time")
This graph also shows some interesting data. Firstly, since I'm still sort of thinking about the organics aspect of the previous graphs and how those weren't available before 2017, I thought it was interesting to see that organics as a category of waste were being tracked before 2017. This graph displays categories of waste, rather than percents which were displayed in the previous graphs and, for DSNY purposes, were tracked differently. So while organics were tracked to some extent before 2017, it was only starting in 2017 that it was deemed important enough to establish as a different statistic for percentage purposes.
Overall, things seem pretty stable. I'm not entirely sure what I expected, but I guess with a city as big as New York, things sort of tend to average out. What I did notice, however, is that paper and organic waste seemed to change inversely from 2013 to 2017 and from 2017 to 2023. From 2013 to 2017, the proportion of paper waste increased by roughly 10%, while the proportion of organic waste decreased by a little less than 10%. While it didn't quite go back to the previous numbers from 2017 to 2023, it did go on that trend - paper waste went back down, however not quite matching the 10% from before - it was on a downward trend to "correct" the deviance from the previous time period, however it did not fully make up for it. Similarly, organic waste increased by a couple of percents from 2017 to 2023. As with paper waste, it didn't quite make up for the changes seen in the previous time period, however it did start a trend of "fixing" the previous change.
The last graph that I will display here is a violin plot of years and aggregate percentages. As with the other graphs, this doesn't display much that can't be determined from just looking at the existing data, or even looking at the previous graphs. However, it shows distributions and, as such, may allow people to make different insights that may be valuable.
# start by melting the dataframe
# transposing and melting puts it into a format that can be used by the violin plot function
# it effectively gets rid of the column names, and gives each value its own row
# create the violin plot
meltedDf = trashYearDf.T.melt(var_name="Year", value_name="Aggregate Percent")
sns.violinplot(x="Year", y="Aggregate Percent", data=meltedDf)
# assigning to a variable to suppress unnecessary output
_ = plt.xticks(rotation=45)
As stated above, much of what this shows can already be determined from previous data. However, the unique shape of a violin plot allows me to make certain insights into the data. For example, each of the four violins (one for each year in the table) is unimodal and skewed toward lower values. This gives me some important information - while there are waste types like paper and organics that comprise larger portions of the overall waste, they are a relative minority out of the total number of waste types. The majority of waste types make up smaller portions of the overall amount of waste - roughly 0% to 15%.
This does, however, change in 2017 and 2023, where it actually gets even more concentrated in the general area from 0% to 15%. This implies more diversity in the types of materials that are being produced and thus disposed of. While this change is relatively subtle, it would be interesting to see if it is a trend that will continue 5, 10, or 15 years from now, or perhaps even longer.
Let's start with some good old linear regression. Yay! A good thing to start with would be finding a line of best fit for each of the five most common categories of waste. To do this, I used an external Python library that handles the actual modeling for me.
# insert a new column and initialize to 0
# this allows me to call update() on the dataframe later
simpleDf.insert(7, "Regression Line", 0)
for each in ["Organics", "Paper", "Plastic", "Metal", "Glass"]:
# iterate over each material group
# create a linear regression model for each one
# get the predicted value, and udpate simpleDf with it
temp = simpleDf[simpleDf["Material Group"] == each]
aggrRegrLine = LinearRegression().fit(temp[["Year"]], temp["Aggregate Percent"])
temp["Regression Line"] = aggrRegrLine.predict(temp[["Year"]])
simpleDf.update(temp)
# create the scatter plot for each material group
# add the regression line
temp.plot.scatter(x="Year", y="Aggregate Percent", title=each, figsize=(15, 5))
temp.plot.line(x="Year", y="Regression Line", ax=plt.gca(), legend=False)
plt.xticks([2005, 2013, 2017, 2023], rotation=45)
So, now that I am able to generate and display the regression lines, how can I make this useful? On their own, each graph doesn't really add much value in terms of data analysis, so I need to combine them.
This step can be a little tricky, since matplotlib (which pandas uses when calling plot()) is more intuitive when it comes to very simple graphs that don't really display a lot of data. Once you start trying to combine graphs while still making sure it is readable, it can become quite tricky.
To accomplish this, there are multiple things that need to be done. Firstly, I need to directly use
matplotlib by calling the plt
module, which was defined at the very beginning of this
project. Because I will be combining a bunch of data, it is simply easiest to do it like this, working
more directly with matplotlib, as opposed to trying to render the data correctly through the dataframe.
Secondly, I need to set colors for each type of waste, which is necessary in order to be able to tell which scatter points are referring to which type of waste.
Lastly, I must follow the time-tested adage, "if it ain't broke, don't fix it." The code for finding
lines of best that uses LinearRegression()
from the previous part of this project worked; it
was not broken, therefore I should not alter it. I know I should not alter it because I tried. When I
tried to change it, it broke, therefore I just left the code how it already was, albeit with some minor
modifications since I am directly calling the plt
module.
# create a plot directly through matplotlib
# this makes it much easier to combine all of the data from the previous graphs into one
fig, ax = plt.subplots(figsize=(15, 5))
# create a dictionary to assign colors to each material group
c = {"Organics": "blue", "Paper": "green",
"Plastic": "red", "Metal": "cyan", "Glass": "magenta"}
# reset regression line column to 0
simpleDf.drop(columns="Regression Line", inplace=True)
simpleDf.insert(7, "Regression Line", 0)
# iterate over each material group
for each in ["Organics", "Paper", "Plastic", "Metal", "Glass"]:
# create a linear regression model for each one and update simpleDf
temp = simpleDf[simpleDf["Material Group"] == each]
aggrRegrLine = LinearRegression().fit(temp[["Year"]], temp["Aggregate Percent"])
temp["Regression Line"] = aggrRegrLine.predict(temp[["Year"]])
simpleDf.update(temp)
# because I already called matplotlib manually, each iteration of the foor loop
# will just go to the same plot
# create the scatter plot for each material group
temp.plot.scatter(x="Year", y="Aggregate Percent",
ax=ax, label=each, color=c[each])
# add the regression line
temp.plot.line(x="Year", y="Regression Line",
ax=ax, legend=False, color=c[each])
# create labels, title, etc.
plt.xticks([2005, 2013, 2017, 2023], rotation=45)
plt.title("Aggregate Percent by Material Group")
plt.xlabel("Year")
plt.ylabel("Aggregate Percent")
# assigning to a variable to suppress unnecessary output
_ = plt.legend()
Just like the graphs from the previous section (Exploratory Data Analysis and Data Visualization), this is further helping to analyze the data. By using linear regression to create lines of best fit and, further, by combining each of those lines of best fit into a single graph, I am better able to visualize trends. Even as compared to the line graph in the "Exploratory Data Analysis and Data Visualization" section, this seems like it presents a lot of new information.
Firstly, this graph cuts down on the "visual noise" presented through the ups and downs of the line graph in the "Exploratory Data Analysis and Data Visualization" section. Instead, this clearly shows me overall trends over the eighteen year time period of the four studies, with the slope of each line of best fit being fairly clear. For example, organics waste overall had a proportional increase from 2005 to 2023, however not as much of a proportional increase as paper waste did.
Secondly, this graph still displays the scatter points, allowing me to see the nuances of the data and shorter-period up and down trends that don't necessarily reflect the entire trend of a given type of waste from 2005 to 2023. By doing this, I am effectively able to include the data from the line graph in the "Exploratory Data Analysis and Data Visualization" section, while still cutting down on the distractions that it might produce when trying to determine overall trends.
Now that I have made these models and displayed them, it is important to determine whether or not they are actually useful. To do this formally, I need to find the p-values. Fortunately, as with the regression models, there is already a module that I can use to do this. Annoyingly with this module, however, I do need to rename the "Aggregate Percent" column to not have a space in it so that I avoid syntax issues with how the module handles input strings. I don't like this, so ultimately change the space to an underscore, and then change it back to a space immediately after using the module. As with before, I will only do this for the five most-common waste types.
# rename "Aggregate Percent" to have an underscore
# required by ols() function
simpleDf.rename(columns={"Aggregate Percent": "Aggregate_Percent"}, inplace=True)
# find stats for a linear regression model for each material group
organicsStats = smfa.ols("Aggregate_Percent ~ Year", data=simpleDf[simpleDf["Material Group"] == "Organics"]).fit()
paperStats = smfa.ols("Aggregate_Percent ~ Year", data=simpleDf[simpleDf["Material Group"] == "Paper"]).fit()
plasticStats = smfa.ols("Aggregate_Percent ~ Year", data=simpleDf[simpleDf["Material Group"] == "Plastic"]).fit()
metalStats = smfa.ols("Aggregate_Percent ~ Year", data=simpleDf[simpleDf["Material Group"] == "Metal"]).fit()
glassStats = smfa.ols("Aggregate_Percent ~ Year", data=simpleDf[simpleDf["Material Group"] == "Glass"]).fit()
# rename it back because I like this one better
simpleDf.rename(columns={"Aggregate_Percent": "Aggregate Percent"}, inplace=True)
# print p-values for each material group
print("Organics Waste Model p-value:\n" + str(organicsStats.pvalues) + "\n")
print("Paper Waste Model p-value:\n" + str(paperStats.pvalues) + "\n")
print("Plastic Waste Model p-value:\n" + str(plasticStats.pvalues) + "\n")
print("Metal Waste Model p-value:\n" + str(metalStats.pvalues) + "\n")
print("Glass Waste Model p-value:\n" + str(glassStats.pvalues) + "\n")
So now that I have the p-values, there's some things I can tell from them. P-values are generally indicative of statistical significance, with lower values indicating stronger significance and, if they are low enough, effectively rejecting the null hypothesis. Typically, this would require a p-value of less than either 0.05 or 0.01, depending on preference. As seen above, however, none of the values are quite there - they all range roughly from 0.15 to 0.9.
So what does this mean? Does this indicate that the models used are just as effective as chance? I would say no. Firstly, there is a very small samples size for each regression line, with just one piece of data for each waste type in each year, or four pieces of data for each waste type across all four years. It has been indicated by researchers that p-values have bounds to their usefulness that necessarily need to be flexible. A 2013 article in Information Systems Research, for instance, indicates that extremely large numbers of samples will almost always lead to a low p-value. (1) Inversely, it can then be realized that extremely small sample sizes (such as these, where there are only four for each model) will generally lead to higher p-values. Therefore, I think in this specific case it is reasonable to disregard the p-values entirely. There is simply not enough data in the table to make models that are strong enough. That being said, I am comfortable using these models going forward - the only thing that needs to be kept in mind when using them is that they may not be very accurate. If there were more samples that were being used, then these models should certainly be rejected, but since there are so few samples, I believe this is a reasonable course of action.
(1) Mingfeng Lin, Henry C. Lucas Jr, Galit Shmueli (2013) Research Commentary—Too Big to Fail: Large Samples and the p-Value Problem. Information Systems Research 24(4):906-917. https://dx.doi.org/10.1287/isre.2013.0480
One of the great parts of linear regression models is that they don't need require data from the year they are predicting for, which allows us to predict what these proportions will look like in the future. Again, it should be kept in mind that these predictions may not be very accurate because of the low sample size and high p-values. Regardless, it can still give me some good insights.
# create a new dataframe to hold the future data predictions
futureDf = simpleDf.copy()
# call matplotlib manually to create the plot
# once again puts everything in one plot, even over multiple iterations
fig, ax = plt.subplots(figsize=(15, 5))
# create a dictionary to assign colors to each material group
c = {"Organics": "blue", "Paper": "green",
"Plastic": "red", "Metal": "cyan", "Glass": "magenta"}
# reset regression line column to 0
futureDf.drop(columns="Regression Line", inplace=True)
futureDf.insert(7, "Regression Line", 0)
# create a new row for each material group in 2028
# add new rows to futureDf
future_rows = pd.DataFrame({'Year': [2028]*5,
'Material Group': ['Organics', 'Paper', 'Plastic', 'Metal', 'Glass']})
futureDf = pd.concat([futureDf, future_rows], ignore_index=True)
# iterate over each material group
for each in ["Organics", "Paper", "Plastic", "Metal", "Glass"]:
# create a linear regression model for each one and update futureDf
# skip 2028 data when creating the models
# BUT include 2028 data when predicting the regression line
# update futureDf with new values
temp = futureDf[futureDf["Material Group"] == each]
aggrRegrLine = LinearRegression().fit(temp[temp["Year"] != 2028][["Year"]], \
temp[temp["Year"] != 2028]["Aggregate Percent"])
temp["Regression Line"] = aggrRegrLine.predict(temp[["Year"]])
futureDf.update(temp)
# create the scatter plot for each material group
temp.plot.scatter(x="Year", y="Aggregate Percent",
ax=ax, label=each, color=c[each])
# add the regression line
temp.plot.line(x="Year", y="Regression Line",
ax=ax, legend=False, color=c[each])
# create labels, title, etc.
plt.xticks([2005, 2013, 2017, 2023, 2028], rotation=45)
plt.title("Aggregate Percent by Material Group")
plt.xlabel("Year")
plt.ylabel("Aggregate Percent")
plt.legend()
# get and print the estimates for each material group in 2028
organicsEstimate = futureDf[futureDf["Material Group"] == "Organics"][futureDf["Year"] == 2028]["Regression Line"].values[0]
paperEstimate = futureDf[futureDf["Material Group"] == "Paper"][futureDf["Year"] == 2028]["Regression Line"].values[0]
plasticEstimate = futureDf[futureDf["Material Group"] == "Plastic"][futureDf["Year"] == 2028]["Regression Line"].values[0]
metalEstimate = futureDf[futureDf["Material Group"] == "Metal"][futureDf["Year"] == 2028]["Regression Line"].values[0]
glassEstimate = futureDf[futureDf["Material Group"] == "Glass"][futureDf["Year"] == 2028]["Regression Line"].values[0]
print("Organics Aggregate Percent Estimate for 2028: " + str(organicsEstimate) + "%")
print("Paper Aggregate Percent Estimate for 2028: " + str(paperEstimate) + "%")
print("Plastic Aggregate Percent Estimate for 2028: " + str(plasticEstimate) + "%")
print("Metal Aggregate Percent Estimate for 2028: " + str(metalEstimate) + "%")
print("Glass Aggregate Percent Estimate for 2028: " + str(glassEstimate) + "%")
Yay! Now I have future predictions. A limitation of using linear regression to predict future values, however, is that they are affectively just extensions of a straight line. They more or less assume that the conditions that were in place when the model was created will continue after the model ends. That being said, I think this is a decent prediction for this situation. In general, each value has been fairly stable from 2005 to 2023, with organics and paper being the biggest exceptions. As compost initiatives continue, I believe that it is reasonable to assume that paper waste will continue to rise relative to other types of waste as people will be composting paper towels, etc. more. While there is no way to perfectly predict the future, techniques like this are able to make reasonable estimates based on existing trends. Accuracy will always be up for debate (even with much more complex models), so only time will tell for certain how accurate these predictions will be.
Now, it is time to make a more complex model based on gradient descent. This uses the gradient descent algorithm that was shown in lecture this semester, which makes this whole process much easier.
# gradient descent algorithm from lecture
def grad_descent(X, y, T, alpha):
m, n = X.shape
theta = np.zeros(n)
f = np.zeros(T)
for i in range(T):
f[i] = 0.5*np.linalg.norm(X.dot(theta)-y)**2
g = np.transpose(X).dot(X.dot(theta)-y)
theta = theta - alpha*g
return (theta, f)
# create numeric classifications for each type of waste
# this will make it easier to make numerical predictions, which is how gradient descent works
numerics = {"Organics": 0, "Paper": 1, "Plastic": 2, "Metal": 3, "Glass": 4}
revNumerics = {0: "Organics", 1: "Paper", 2: "Plastic", 3: "Metal", 4: "Glass"}
# create new dataframes for training and testing the model
# add a bias column for gradient descent
# convert material group to a numeric value and put in a new column
trainingDf = simpleDf.copy()
trainingDf["bias"] = 1
trainingDf["numericWasteType"] = trainingDf["Material Group"].apply(lambda x: numerics[x] if x in numerics else -1)
# remove rows with -1 in the numericWasteType column
# -1 was assigned to anything not in the five most common types of waste
trainingDf = trainingDf[trainingDf["numericWasteType"] != -1]
testingDf = trainingDf[trainingDf["Year"] == 2005]
trainingDf = trainingDf[trainingDf["Year"] != 2005]
# call grad_descent to create the model
# "Aggregate Percent," "Refuse Percent," "MGP (Metal, Glass, Plastic) Percent," "Paper Percent,"
# "Organics Percent," and "bias" are the independent variables
# "numericWasteType" is the dependent variable
# I chose 10,000 iterations and 0.00001 learning rate, which produced good accuracy
# HOWEVER this requires trial and error to get to this point
# e.g., I started out with 500 iterations and 0.0001 learning rate and had to adjust from there
trainingModel = grad_descent(trainingDf[["Aggregate Percent", "Refuse Percent", \
"MGP (Metal, Glass, Plastic) Percent", "Paper Percent", \
"Organics Percent", "bias"]], \
trainingDf["numericWasteType"], 10000, 0.00001)
# get the weights and bias
aggrWeight = trainingModel[0][0]
refuseWeight = trainingModel[0][1]
mgpWeight = trainingModel[0][2]
paperWeight = trainingModel[0][3]
organicsWeight = trainingModel[0][4]
bias = trainingModel[0][5]
# print weights and bias
print("Aggregate Weight: " + str(aggrWeight))
print("Refuse Weight: " + str(refuseWeight))
print("MGP Weight: " + str(mgpWeight))
print("Paper Weight: " + str(paperWeight))
print("Organics Weight: " + str(organicsWeight))
print("Bias: " + str(bias))
So now that I have the weights, I can take it a step further and try to see how good this model is by comparing predicted values with the actual values from the 2005 data, which the model was not trained on.
# test the model by predicting the waste type for 2005
# effectively: (feature1 * weight1) + (feature2 * weight2) + ... + bias
testingDf["predictedWasteType"] = testingDf[["Aggregate Percent", "Refuse Percent", \
"MGP (Metal, Glass, Plastic) Percent", "Paper Percent", \
"Organics Percent"]].dot(trainingModel[0][0:-1]) \
+ trainingModel[0][-1]
# round to integers and adjust out-of-bounds values to be the nearest min or max
testingDf["predictedWasteType"] = testingDf["predictedWasteType"].apply(lambda x: 0 if x < 0 else 4 if x > 4 else round(x))
# get the string value of the numeric waste type and assign it
testingDf["predictedWasteType"] = testingDf["predictedWasteType"].apply(lambda x: revNumerics[round(x)])
# determine correctness of predictions
correct = 0
total = 0
for row in testingDf.iterrows():
if row[1]["Material Group"] == row[1]["predictedWasteType"]:
correct += 1
total += 1
# print the correctness
print("Correct Predictions: " + str(correct))
print("Total Predictions: " + str(total))
print("Accuracy: " + str(correct / total))
So with all of that, it looks like I managed to get an accuracy of 0.8 with this model, which is pretty good! This means my model is much better at predicting waste categories than simply by chance.
Having a relatively accurate model like this can be extremely useful for predicting types of waste. While this relies on certain data already being known (e.g., aggregate percent), it still has applications where it can be useful in, such as properly classifying some unknown material based on how it is commonly processed. As technology continues to develop, new materials continue to be made, and waste management continues to be a big issue, this remains very important.
Now that I have finished all of the actual development aspects of the project, I need to move onto the purpose of data science. What is any of this information even good for? What kind of insights can be made from this data? How was this data collected and processed?
I'll start with data collection and processing. This data was all collected and initially processed by DSNY in accordance with the laws of New York and the United States. Its purpose is strongly ethical and is necessary in order to effectively maintain waste management systems in New York City. The data is free to use and download (access links are at the beginning of this project) and, for the purposes of this project, was processed heavily by me in order to establish trends in waste management in New York City from 2005 to 2023 and make predictions for what those trends will look like in the future. None of the data collection or processing that has been conducted has been done with the purpose to personally identify people or harm anyone. Lastly regarding ethics, I did not seek to alter or otherwise misrepresent my results by "p-hacking" or any other techniques; to the best of my ability, I have ensured that everything shown in this project is fully and completely representative of the intial data. While my specific project has not undergone any formal ethics review beforehand, I believe it is in accordance with accepted ethical standards.
This data is also reproducible, which is an ever-growing concern in real-world research. This has been one of my main focuses throughout the project. Not only is this a sign of good faith by me or anyone else who makes their own reproducible projects, but it is also a way of holding my and countless other people's work accountable. By making work reproducible, it gives other people the opportunity to verify it - so for example, if it turns out I have accidentally made a huge mistake in this project and someone else notices it because they follow the exact same steps and don't get the same results, then that person can let me know, serving as both a learning experience and as a way to preemptively fix problems that may have been caused by those initial bad results.
So what about actual insights? What about the data and the graphs in this project? What do those tell us? Ultimately, it boils down to what the needs of DSNY are (or really the needs of any other waste management organization as well). DSNY faces unique challenges due to the lack of space in New York City for landfills. Being able to properly analyze data like this and make models based off of it is crucial in such logistically complex organizations; whether determining better recycling/reusing methods, determining which landfill external to the city to route waste to, allocating funds to the construction of new landfills, or hiring specialists who can properly determine how to handle potentially hazardous materials, these kinds of insights are absolutely necessary. Ultimately, DSNY and other waste management organizations are an underrecognized backbone of society that were crucial to the development of cities, such as with the urban migrations of the late nineteenth and early twentieth centuries in the United States, and remain extremely important to the continued existence of cities. Having the ability to properly handle data from the massive volumes of waste they process is critical to keeping cities across the world functioning properly.