Statistical Methods for Computer Intrusion Detection

Masquerading User Data

We have collected a data set with seeded masquerading users to compare various intrusion detection methods. The data set is available here .

The data consist of 50 files corresponding to one user each. Each file contains 15,000 commands (audit data generated with acct). The first 5000 commands for each user do not contain any masqueraders and are intended as training data. The next 10,000 commands can be thought of as 100 blocks of 100 commands each. They are seeded with masquerading users, i.e. with data of another user not among the 50 users.
At any given block after the initial 5000 commands a masquerade starts with a probability of 1%. If the previous block was a masquerade, the next block will also be a masquerade with a probability of 80%. About 5% of the test data contain masquerades.

This data set is used in an article in Statistical Science (see publications on the left). For further information please consult this article or contact me.

Masquerade Data (zip File) Uncompress using WinZip.
Masquerade Data (Unix) uncompress (gunzip), and de-tar (tar -x) the data set.
This contains 50 files, one each for 50 users. Each file contains 15000 lines. Each line has one command.
Location of masquerades (WIndows ascii file)
This file contains 100 rows and 50 columns. Each column corresponds to one of the 50 users. Each row corresponds to a set of 100 commands, starting with command 5001 and ending with command 15000. The entries in the files are 0 or 1. 0 means that the corresponding 100 commands are not contaminated by a masquerader. 1 means they are contaminated.
Scores and Thresholds for intrusion methods (ZIP File)
This contains the scores and the thresholds of the individual detection algoritms used in the statistical science paper. An alarm is sounded when the score exceeds the corresponding threshold. Filenames that contain ".up." refer to algorithms that updated thresholds and or the algorithms based on earlier data. Filenames that contain ".noup." base their thresholds and scores only on the training data. The Statistical Science paper contains more details.
For example, the data can be used to recreated ROC curves to enable comparison to other methods.
Splus Function to create ROC curves
Two vectors called x and y are needed as input. x are a method's scores when there is no intruder present , y a method's scores when there is an intruder present (The scores can be separated into x and y according to "location of masquerades").
The Splus function then considers all possible values of thresholds z (basically all unique values from x and y ). For each threshold a missing alarm/ false alarm tradeoff is obtained. The tradeoffs each represent one point on the ROC curve. The actual thresholds generated by the individual methods are not needed to construct ROC curves.
This function is only useful when every user has the same threshold. When users have different thresholds these need to be subtracted from the scores so that all users have the same threshold (zero).
Individual Figures from the Statistical Science Paper (ZIP File)
Here are individual graphs from the statistical science paper.

Here are some early theses/ Papers based on this data set :

Kwong Yung, "Update Algorithms for Masquerade Detection" , Ph.D. thesis, Department of Statistics, Stanford University, 2003.
Ke Wang, Salvatore J. Stolfo. "One Class Training for Masquerade Detection ". 3rd IEEE Conference Data Mining Workshop on Data Mining for Computer Security, Florida, November 19, 2003.
Masquerade Detection Using Truncated Command Lines Roy A. Maxion and Tahlia N. Townsend International Conference on Dependable Systems & Networks: Washington, DC, 23-26 June 2002. https://ieeexplore.ieee.org/document/1028903

Return to Home Page
Remove navigation bar on the left