Niko Urena

Austin Animal Center Shelter Outcomes

José Manuel Ureña
6 min readOct 25, 2020

--

Dataset, Model Target

Dataset use for this blog post is from Austin Animal Center Shelter Outcomes, dataset can be found at Kaggle https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-outcomes-and

Code for the ML model: https://github.com/urenajose/DS-Unit-1-Build/blob/master/Unit%202%20build/DS_Unit_2_Build_ACC_Shelter_Dataset_ipyn.ipynb

The Dataset shape is (78244, 22), all the columns are string type, most of the information is categorical, with some columns having date time information.

 0   age_upon_outcome  78248 non-null  object
1 animal_id 78256 non-null object
2 animal_type 78256 non-null object
3 breed 78256 non-null object
4 color 78256 non-null object
5 date_of_birth 78256 non-null object
6 datetime 78256 non-null object
7 monthyear 78256 non-null object
8 name 54370 non-null object
9 outcome_subtype 35963 non-null object
10 outcome_type 78244 non-null object
11 sex_upon_outcome 78254 non-null object

I selected the target for the model “outcome_type”, although it is a categorical multi-class problem I decided to make it into a binary problem did the animal lived or died ?

Adoption           33112
Transfer 23499
Return to Owner 14354
Euthanasia 6080
Died 680
Disposal 307
Rto-Adopt 150
Missing 46
Relocate 16
NaN 12

Baseline

Lived = 1 and Died = 0, was the categories were unified I end up with the following results 90.9% of the animals lived, where 9.1% died. As the majority occurs with 90% frequency making the classes have a huge imbalanced.

df_shelter['Target'].value_counts(normalize=True)
1.0 0.90968
0.0 0.09032

Random Forest

Without wrangling the data any further I decided to use a Random Tree Classifier model and Ordinal Encoder to verify feature importance and do an early score on the Dataset.

Roc curve for Random Forest Model

1st Permutation importance

As per the permutation importance of this early model:

0.0649 ± 0.0008	animal_type
0.0622 ± 0.0010 sex_upon_outcome
0.0370 ± 0.0009 breed
0.0356 ± 0.0005 age_upon_outcome
0.0303 ± 0.0005 animal_id
0.0301 ± 0.0004 color

Also, it gave me an unexpected cross validation score of 0.9499, I image will be much lower.

As far as, this model distinguishing between classes the model had a ROC AUC of 0.862

                precision    recall  f1-score   support

0.0 0.90 0.48 0.62 1421
1.0 0.95 0.99 0.97 14227

accuracy 0.95 15648
macro avg 0.92 0.74 0.80 15648
weighted avg 0.95 0.95 0.94 15648

XGB Classifier

I proceeded to wrangle the data and creating some extra features, to see if I can improve the model but also to give the model some flexibility in the type of predictions I could attempt. I used a XGB Classifier early stop of 50 rounds, I utilize an Ordinal Encoder. My cross validation score was not greatly improve compare to Random forest resulting in .952, in the other hand the model ROC AUC score improve drastically by .913 a .05 differences. I also so changes in my classification report score in which the recall for Dead animal outcomes was improved compare to Random forest Score.

Roc Curv XGB classier model
               precision    recall  f1-score   support

0.0 0.89 0.51 0.64 1405
1.0 0.95 0.99 0.97 14244

accuracy 0.95 15649
macro avg 0.92 0.75 0.81 15649
weighted avg 0.95 0.95 0.94 15649

I proceeded to run another permutation importance to see if some features I had created have contributed to the model some of them did such is the case of “Name_known”, which return a 1 if the name of the animal was known, or 0 if the name of the animal was null, also the year the animal was born had higher importance than complete date of birth.

2nd Permutation Importance

0.0450 ± 0.0011	animal_type
0.0367 ± 0.0005 Name_known
0.0134 ± 0.0004 breed
0.0093 ± 0.0006 age_upon_outcome_d
0.0088 ± 0.0004 sex_upon_outcome
0.0068 ± 0.0006 year_birth
0.0038 ± 0.0006 datetime
0.0028 ± 0.0002 year_outcome
0.0026 ± 0.0003 date_of_birth
0.0021 ± 0.0002 day_outcome
0.0019 ± 0.0002 color
0.0018 ± 0.0003 month_outcome
0.0017 ± 0.0004 month_birth
0.0013 ± 0.0002 selected_breed
0.0002 ± 0.0001 sex

A general rule the effect of each feature is diluted the more features the model has. But I expect some of these extra features will give more flexibility in policy creation and insights from the model and more flexibility in ways to look at the data.

Logistic Regression

Finally, I use the Dataset with a Logistic Regression Model which perform similar to Random Forest model with a Cross Validation Score of .948 , and with a better ROC AUC Score .895, just lily worst than the XGB Classifier. The Logistic regression model was much quicker with a Wall time: 282 µs (microseconds); compare XGB Classifier .495 ms (milliseconds). Also much, much faster than the 23.6 s (seconds) of the Random Forest.

               precision    recall  f1-score   support

0.0 0.84 0.48 0.61 1405
1.0 0.95 0.99 0.97 14244

accuracy 0.95 15649
macro avg 0.90 0.74 0.79 15649
weighted avg 0.94 0.95 0.94 15649
Comparison of ROC Curve for all 3 Models

Conclusion about the model selection: XGB Boost model will produce best result. If we had a larger dataset the Logistic Regression model might be desirable because its superior in speed. The random forest being the least desirable model out of the three.

Understanding our XGB Classifier Model

The partial dependence plot (PDP)

Animals that were either spayed or neutered (fix), their outcomes improve. Female that were not spayed did better than males not neutered. If the status was unknown decrease the outcome significantly.

PDP for feature “Sex upon outcome”

If the animal had a name at the time of outcome, it improve the outcome of the animal. as much as 10%

PDP Interact does animal have a name? And top breed.

The younger the animal the less negative impact on the outcome not having a name had.

PDP Interact does animal have a name? And age of animal (days)

Drilling down to specific observation

If we look at the overall transformed training data average impact on model output magnitude. Shows those features with greater impact.

For an observation representing a bat the feature representing the year of birth and Date and time of the outcome have the most significant high impact. The fact the bat is a bat has a huge impact in lowering the score.

animal_type                         Other
breed Bat
color Brown
date_of_birth 2014-09-09T00:00:00
datetime 2015-09-10T09:08:00
sex_upon_outcome unknown
Name_known 0
age_upon_outcome_d 366
sex unknown
year_birth 2014
month_birth 9
year_outcome 2015
month_outcome 9
day_outcome 10
selected_breed Non_dog_cat

For an observation representing a cat the feature representing sex upon outcome = neutered male has the biggest positive magnitude; as well as only being ~3 months of age and having a name like “Yoda” does not hurt either!

animal_type                           Cat
breed Domestic Longhair
color Brown Tabby
date_of_birth 2014-03-16T00:00:00
datetime 2014-06-20T18:22:00
sex_upon_outcome neutered male
Name_known 1
age_upon_outcome_d 96
sex male
year_birth 2014
month_birth 16
year_outcome 2014
month_outcome 6
day_outcome 20
selected_breed Cat
Name: 21087, dtype: object

“Yoda” the Cat: features average impact on prediction output magnitude. Here we can see more clearly what features have the greatest magnitude on Yoda’s outcome and being a Jedi is not one of them. Also, we can see that feature magnitude does not align 1:1 with the features' magnitude of the overall model.

--

--