The current dataset contains the samples generated from 6 open-source projects, namely. OpenSSL, FFmpeg, HTTPD, NGINX, Libtiff, and Libav.
For each project, there are 3 pickle.gz
files like nginx_after_fix_extractor_0.pickle.gz
, nginx_labeler_1.pickle.gz
, and nginx_labeler_0.pickle.gz
, which are generated by two slightly different extractors (see label_source
field in the sample format description).
Each pickle.gz
file contains compressed samples in JSON (e.g. auto_labeler_0.json).
Function print_sample() in read_pickled_samples.py reads the JSON objects from pickle file and decode the compressed static analysis output.
Note that by default, print_sample() only reads the first issue in each file, so it will display 1 issues loaded
even though there may be more issues in the file. You can comment out lines 36-37 (if cnt == 1: break
) to load all issues in the file.
We provide a global split file splits.csv
, which specifies the train
, dev
, and test
sets:
id,split,project
httpd_82b42a45bba53a76fbf167dfe944131e785f5514_1,dev,httpd
...
httpd_1bd8218a89d7b01a14f6172cacfe0e61bee86689_1,test,httpd
...
httpd_598682ce281bf6f4783e9ad3b09639c1686add8e_1,train,httpd
...
For example, sample identified by httpd_82b42a45bba53a76fbf167dfe944131e785f5514_1
belongs to the dev
set.
Note:
- The sample ids are unique.
- If you see sample ids like
openssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_0
andopenssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_1
, it doesn't mean they are the same sample with conflicting labels. Instead, they are different samples:openssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_1
is anauto_labeler
sample (see Sample Types) fromopenssl_labeler_1.pickle.gz
.openssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_0
is anafter_fix_extractor
sample fromopenssl_after_fix_extractor_0.pickle.gz
. Basically, we took the positiveauto-labeler
sampleopenssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_1
, extracted the corresponding functions from the after-fix version, produced the correspondingafter-fix-extractor
sample and assigned0
as its label.- The
splits.csv
doesn't have the sample types. The data preparation script will add the sample types in the output.
The example script split_data.py takes the splits.csv
(line 9) and the folder with the pickle.gz
files (line 13) as input. It creates the inputs for BERT model training and testing in a folder specified at line 14.