Austin Animal Center Shelter Outcomes
Dataset, Model Target
Dataset use for this blog post is from Austin Animal Center Shelter Outcomes, dataset can be found at Kaggle https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-outcomes-and
Code for the ML model: https://github.com/urenajose/DS-Unit-1-Build/blob/master/Unit%202%20build/DS_Unit_2_Build_ACC_Shelter_Dataset_ipyn.ipynb
The Dataset shape is (78244, 22), all the columns are string type, most of the information is categorical, with some columns having date time information.
0 age_upon_outcome 78248 non-null object
1 animal_id 78256 non-null object
2 animal_type 78256 non-null object
3 breed 78256 non-null object
4 color 78256 non-null object
5 date_of_birth 78256 non-null object
6 datetime 78256 non-null object
7 monthyear 78256 non-null object
8 name 54370 non-null object
9 outcome_subtype 35963 non-null object
10 outcome_type 78244 non-null object
11 sex_upon_outcome 78254 non-null object
I selected the target for the model “outcome_type”, although it is a categorical multi-class problem I decided to make it into a binary problem did the animal lived or died ?
Adoption 33112
Transfer 23499
Return to Owner 14354
Euthanasia 6080
Died 680
Disposal 307
Rto-Adopt 150
Missing 46
Relocate 16
NaN 12
Baseline
Lived = 1 and Died = 0, was the categories were unified I end up with the following results 90.9% of the animals lived, where 9.1% died. As the majority occurs with 90% frequency making the classes have a huge imbalanced.
df_shelter['Target'].value_counts(normalize=True)
1.0 0.90968
0.0 0.09032
Random Forest
Without wrangling the data any further I decided to use a Random Tree Classifier model and Ordinal Encoder to verify feature importance and do an early score on the Dataset.
1st Permutation importance
As per the permutation importance of this early model:
0.0649 ± 0.0008 animal_type
0.0622 ± 0.0010 sex_upon_outcome
0.0370 ± 0.0009 breed
0.0356 ± 0.0005 age_upon_outcome
0.0303 ± 0.0005 animal_id
0.0301 ± 0.0004 color
Also, it gave me an unexpected cross validation score of 0.9499, I image will be much lower.
As far as, this model distinguishing between classes the model had a ROC AUC of 0.862
precision recall f1-score support
0.0 0.90 0.48 0.62 1421
1.0 0.95 0.99 0.97 14227
accuracy 0.95 15648
macro avg 0.92 0.74 0.80 15648
weighted avg 0.95 0.95 0.94 15648
XGB Classifier
I proceeded to wrangle the data and creating some extra features, to see if I can improve the model but also to give the model some flexibility in the type of predictions I could attempt. I used a XGB Classifier early stop of 50 rounds, I utilize an Ordinal Encoder. My cross validation score was not greatly improve compare to Random forest resulting in .952, in the other hand the model ROC AUC score improve drastically by .913 a .05 differences. I also so changes in my classification report score in which the recall for Dead animal outcomes was improved compare to Random forest Score.
precision recall f1-score support
0.0 0.89 0.51 0.64 1405
1.0 0.95 0.99 0.97 14244
accuracy 0.95 15649
macro avg 0.92 0.75 0.81 15649
weighted avg 0.95 0.95 0.94 15649
I proceeded to run another permutation importance to see if some features I had created have contributed to the model some of them did such is the case of “Name_known”, which return a 1 if the name of the animal was known, or 0 if the name of the animal was null, also the year the animal was born had higher importance than complete date of birth.
2nd Permutation Importance
0.0450 ± 0.0011 animal_type
0.0367 ± 0.0005 Name_known
0.0134 ± 0.0004 breed
0.0093 ± 0.0006 age_upon_outcome_d
0.0088 ± 0.0004 sex_upon_outcome
0.0068 ± 0.0006 year_birth
0.0038 ± 0.0006 datetime
0.0028 ± 0.0002 year_outcome
0.0026 ± 0.0003 date_of_birth
0.0021 ± 0.0002 day_outcome
0.0019 ± 0.0002 color
0.0018 ± 0.0003 month_outcome
0.0017 ± 0.0004 month_birth
0.0013 ± 0.0002 selected_breed
0.0002 ± 0.0001 sex
A general rule the effect of each feature is diluted the more features the model has. But I expect some of these extra features will give more flexibility in policy creation and insights from the model and more flexibility in ways to look at the data.
Logistic Regression
Finally, I use the Dataset with a Logistic Regression Model which perform similar to Random Forest model with a Cross Validation Score of .948 , and with a better ROC AUC Score .895, just lily worst than the XGB Classifier. The Logistic regression model was much quicker with a Wall time: 282 µs (microseconds); compare XGB Classifier .495 ms (milliseconds). Also much, much faster than the 23.6 s (seconds) of the Random Forest.
precision recall f1-score support
0.0 0.84 0.48 0.61 1405
1.0 0.95 0.99 0.97 14244
accuracy 0.95 15649
macro avg 0.90 0.74 0.79 15649
weighted avg 0.94 0.95 0.94 15649
Conclusion about the model selection: XGB Boost model will produce best result. If we had a larger dataset the Logistic Regression model might be desirable because its superior in speed. The random forest being the least desirable model out of the three.
Understanding our XGB Classifier Model
The partial dependence plot (PDP)
Animals that were either spayed or neutered (fix), their outcomes improve. Female that were not spayed did better than males not neutered. If the status was unknown decrease the outcome significantly.
If the animal had a name at the time of outcome, it improve the outcome of the animal. as much as 10%
The younger the animal the less negative impact on the outcome not having a name had.
Drilling down to specific observation
If we look at the overall transformed training data average impact on model output magnitude. Shows those features with greater impact.
For an observation representing a bat the feature representing the year of birth and Date and time of the outcome have the most significant high impact. The fact the bat is a bat has a huge impact in lowering the score.
animal_type Other
breed Bat
color Brown
date_of_birth 2014-09-09T00:00:00
datetime 2015-09-10T09:08:00
sex_upon_outcome unknown
Name_known 0
age_upon_outcome_d 366
sex unknown
year_birth 2014
month_birth 9
year_outcome 2015
month_outcome 9
day_outcome 10
selected_breed Non_dog_cat
For an observation representing a cat the feature representing sex upon outcome = neutered male has the biggest positive magnitude; as well as only being ~3 months of age and having a name like “Yoda” does not hurt either!
animal_type Cat
breed Domestic Longhair
color Brown Tabby
date_of_birth 2014-03-16T00:00:00
datetime 2014-06-20T18:22:00
sex_upon_outcome neutered male
Name_known 1
age_upon_outcome_d 96
sex male
year_birth 2014
month_birth 16
year_outcome 2014
month_outcome 6
day_outcome 20
selected_breed Cat
Name: 21087, dtype: object
“Yoda” the Cat: features average impact on prediction output magnitude. Here we can see more clearly what features have the greatest magnitude on Yoda’s outcome and being a Jedi is not one of them. Also, we can see that feature magnitude does not align 1:1 with the features' magnitude of the overall model.