Austin Animal Center Shelter Outcomes

6 min readOct 25, 2020

Dataset, Model Target

Dataset use for this blog post is from Austin Animal Center Shelter Outcomes, dataset can be found at Kaggle https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-outcomes-and

Code for the ML model: https://github.com/urenajose/DS-Unit-1-Build/blob/master/Unit%202%20build/DS_Unit_2_Build_ACC_Shelter_Dataset_ipyn.ipynb

The Dataset shape is (78244, 22), all the columns are string type, most of the information is categorical, with some columns having date time information.

 0   age_upon_outcome  78248 non-null  object
 1   animal_id         78256 non-null  object
 2   animal_type       78256 non-null  object
 3   breed             78256 non-null  object
 4   color             78256 non-null  object
 5   date_of_birth     78256 non-null  object
 6   datetime          78256 non-null  object
 7   monthyear         78256 non-null  object
 8   name              54370 non-null  object
 9   outcome_subtype   35963 non-null  object
 10  outcome_type      78244 non-null  object
 11  sex_upon_outcome  78254 non-null  object

I selected the target for the model “outcome_type”, although it is a categorical multi-class problem I decided to make it into a binary problem did the animal lived or died ?

Adoption           33112
Transfer           23499
Return to Owner    14354
Euthanasia          6080
Died                 680
Disposal             307
Rto-Adopt            150
Missing               46
Relocate              16
NaN                   12

Baseline

Lived = 1 and Died = 0, was the categories were unified I end up with the following results 90.9% of the animals lived, where 9.1% died. As the majority occurs with 90% frequency making the classes have a huge imbalanced.

df_shelter['Target'].value_counts(normalize=True)
1.0    0.90968
0.0    0.09032

Random Forest

Without wrangling the data any further I decided to use a Random Tree Classifier model and Ordinal Encoder to verify feature importance and do an early score on the Dataset.

1st Permutation importance

As per the permutation importance of this early model:

0.0649 ± 0.0008	animal_type
0.0622 ± 0.0010	sex_upon_outcome
0.0370 ± 0.0009	breed
0.0356 ± 0.0005	age_upon_outcome
0.0303 ± 0.0005	animal_id
0.0301 ± 0.0004	color

Also, it gave me an unexpected cross validation score of 0.9499, I image will be much lower.

As far as, this model distinguishing between classes the model had a ROC AUC of 0.862

                precision    recall  f1-score   support

         0.0       0.90      0.48      0.62      1421
         1.0       0.95      0.99      0.97     14227

    accuracy                           0.95     15648
   macro avg       0.92      0.74      0.80     15648
weighted avg       0.95      0.95      0.94     15648

XGB Classifier

I proceeded to wrangle the data and creating some extra features, to see if I can improve the model but also to give the model some flexibility in the type of predictions I could attempt. I used a XGB Classifier early stop of 50 rounds, I utilize an Ordinal Encoder. My cross validation score was not greatly improve compare to Random forest resulting in .952, in the other hand the model ROC AUC score improve drastically by .913 a .05 differences. I also so changes in my classification report score in which the recall for Dead animal outcomes was improved compare to Random forest Score.

               precision    recall  f1-score   support

         0.0       0.89      0.51      0.64      1405
         1.0       0.95      0.99      0.97     14244

    accuracy                           0.95     15649
   macro avg       0.92      0.75      0.81     15649
weighted avg       0.95      0.95      0.94     15649

I proceeded to run another permutation importance to see if some features I had created have contributed to the model some of them did such is the case of “Name_known”, which return a 1 if the name of the animal was known, or 0 if the name of the animal was null, also the year the animal was born had higher importance than complete date of birth.

2nd Permutation Importance

0.0450 ± 0.0011	animal_type
0.0367 ± 0.0005	Name_known
0.0134 ± 0.0004	breed
0.0093 ± 0.0006	age_upon_outcome_d
0.0088 ± 0.0004	sex_upon_outcome
0.0068 ± 0.0006	year_birth
0.0038 ± 0.0006	datetime
0.0028 ± 0.0002	year_outcome
0.0026 ± 0.0003	date_of_birth
0.0021 ± 0.0002	day_outcome
0.0019 ± 0.0002	color
0.0018 ± 0.0003	month_outcome
0.0017 ± 0.0004	month_birth
0.0013 ± 0.0002	selected_breed
0.0002 ± 0.0001	sex

A general rule the effect of each feature is diluted the more features the model has. But I expect some of these extra features will give more flexibility in policy creation and insights from the model and more flexibility in ways to look at the data.

Logistic Regression

Finally, I use the Dataset with a Logistic Regression Model which perform similar to Random Forest model with a Cross Validation Score of .948 , and with a better ROC AUC Score .895, just lily worst than the XGB Classifier. The Logistic regression model was much quicker with a Wall time: 282 µs (microseconds); compare XGB Classifier .495 ms (milliseconds). Also much, much faster than the 23.6 s (seconds) of the Random Forest.

               precision    recall  f1-score   support

         0.0       0.84      0.48      0.61      1405
         1.0       0.95      0.99      0.97     14244

    accuracy                           0.95     15649
   macro avg       0.90      0.74      0.79     15649
weighted avg       0.94      0.95      0.94     15649

Comparison of ROC Curve for all 3 Models

Conclusion about the model selection: XGB Boost model will produce best result. If we had a larger dataset the Logistic Regression model might be desirable because its superior in speed. The random forest being the least desirable model out of the three.

Understanding our XGB Classifier Model

The partial dependence plot (PDP)

Animals that were either spayed or neutered (fix), their outcomes improve. Female that were not spayed did better than males not neutered. If the status was unknown decrease the outcome significantly.

If the animal had a name at the time of outcome, it improve the outcome of the animal. as much as 10%

PDP Interact does animal have a name? And top breed.

The younger the animal the less negative impact on the outcome not having a name had.

PDP Interact does animal have a name? And age of animal (days)

Drilling down to specific observation

If we look at the overall transformed training data average impact on model output magnitude. Shows those features with greater impact.

For an observation representing a bat the feature representing the year of birth and Date and time of the outcome have the most significant high impact. The fact the bat is a bat has a huge impact in lowering the score.

animal_type                         Other
breed                                 Bat
color                               Brown
date_of_birth         2014-09-09T00:00:00
datetime              2015-09-10T09:08:00
sex_upon_outcome                  unknown
Name_known                              0
age_upon_outcome_d                    366
sex                               unknown
year_birth                           2014
month_birth                             9
year_outcome                         2015
month_outcome                           9
day_outcome                            10
selected_breed                Non_dog_cat

For an observation representing a cat the feature representing sex upon outcome = neutered male has the biggest positive magnitude; as well as only being ~3 months of age and having a name like “Yoda” does not hurt either!

animal_type                           Cat
breed                   Domestic Longhair
color                         Brown Tabby
date_of_birth         2014-03-16T00:00:00
datetime              2014-06-20T18:22:00
sex_upon_outcome            neutered male
Name_known                              1
age_upon_outcome_d                     96
sex                                  male
year_birth                           2014
month_birth                            16
year_outcome                         2014
month_outcome                           6
day_outcome                            20
selected_breed                        Cat
Name: 21087, dtype: object

“Yoda” the Cat: features average impact on prediction output magnitude. Here we can see more clearly what features have the greatest magnitude on Yoda’s outcome and being a Jedi is not one of them. Also, we can see that feature magnitude does not align 1:1 with the features' magnitude of the overall model.