Provides an anomaly score for categorical and date time data
Categorical Outlier package was specially designed to detect outliers in categorical data. The project was built as there is no ready-to-use packages available to detect unusual patterns in categorical data. ALmost everything focuses on numerical features.
The categoricaloutlier was built to provide a score for the outlier-ness of the categorical features. It supports following key features -
It learns from the historical data to quantify the anomalous nature of a new observation. A feature high variance will get a low score for an unseen observation as compared to a feature with low or zero variance.
This is the first package that targets outliers amongst categorical features as opposed to innumerable libraries for numerical features.
In the following paragraphs, I am going to describe how you can get and use categoricaloutlier for your own projects.
To download categoricaloutlier, either fork this github repo or simply use Pypi via pip.
$ pip install categoricaloutlier
Categorical Outlier can be used by simple commands to get a score for outlier-ness
from categoricaloutlier import TrainOutlier, PredictOutlier
And you are ready to go! At this point, I want to clearly distinct between a AnomalyTrainer and a AnomalyScorer.
AnomalyTrainer class is used to train the categorical and date time features on the historical data. It build a fundamental profile from the data for the categorical features.
It expects 4 parameters to train a model -
The current version supports day of the week and time of the day to determine anomalies. In future versions, the support may be extended to include other temporal features.
The categorical columns can be 2-dimensional feature as well. 2-dimensional features are derive features by combining 2 categorical columns into one. This is imperative as in certain cases the combination might be unusual as opposed to independent features.
Training the a new model is just two lines of code
Let’s create a new TrainOulier object and initialize required parameters. Ensure ‘cat_cols’ and ‘datetime_cols’ are lists and not array or any other sequence else it will throw an exception.
at = TrainOutlier(95,cat_cols,datetime_cols)
Make a call to train function to train the model on the data
at.train(df_train)
This trains the model on the data and gives an object of AnomalyTrainer Class. This object needs to passed to the scorer to generate scores for a new observation.
AnomalyScorer class is built to obtain the score of outlier-ness for a new observation of the same data. It uses a sigmoid function to provide a score between 1 to 100. 1 representing most similar to existing data and 100 representing most dissimilar to existing data.
It expects 2 parameters to provide a score -
The AnomalyScorer obtains the categorical and date time columns that was used at the time of training to predict a score. The test data should have at least one observation.
Predicting a score is a one line code
outliers = PredictOutlier(at,test_data)
outliers.scores
The result is an object of class PredictOutlier which has a scores variable that gives a score(s) between 0 to 1 determining the outlier-ness of the data.
MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the “Software”), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.