Abstract
As machine learning-assisted vulnerability detection research matures, it is critical to understand the datasets being used by existing papers. In this paper, we explore 7 C/C++ datasets and evaluate their suitability for machine learning-assisted vulnerability detection. We also present a new dataset, named Wild C, containing over 10.3 million individual opensource C/C++ files – a sufficiently large sample to be reasonably considered representative of typical C/C++ code. To facilitate comparison, we tokenize all of the datasets and perform the analysis at this level. We make three primary contributions. First, while all the datasets differ from our Wild C dataset, some do so to a greater degree. This includes divergence in file lengths and token usage frequency. Additionally, none of the datasets contain the entirety of the C/C++ vocabulary. These missing tokens account for up to 11% of all token usage. Second, we find all the datasets contain duplication with some containing a significant amount. In the Juliet dataset, we describe augmentations of test cases making the dataset susceptible to data leakage. This augmentation occurs with such frequency that a random 80/20 split has roughly 58% overlap of the test with the training data. Finally, we collect and process a large dataset of C code, named Wild C. This dataset is designed to serve as a representative sample of all C/C++ code and is the basis for our analyses.
Original language | English |
---|---|
Pages (from-to) | 36-53 |
Number of pages | 18 |
Journal | CEUR Workshop Proceedings |
Volume | 3095 |
State | Published - 2021 |
Event | 2021 Conference on Applied Machine Learning in Information Security, CAMLIS 2021 - Arlington, United States Duration: Nov 4 2021 → Nov 5 2021 |
ASJC Scopus Subject Areas
- General Computer Science
Disciplines
- Computer Sciences
- Engineering