Large Open-Source Data Set Released to Help Train Algorithms Spot Malware

For the first time, a large dataset has been released by a security firm to help AI research and training of machine learning models that statically detect malware. The data set released by cybersecurity firm Endgame is called EMBER is a collection of more than a million representations of benign and malicious Windows-portable executable files. Hyrum Anderson, Endgame’s technical director of data science who worked on EMBER, says: “This dataset fills a void in the information security machine learning community: a benign/malicious dataset that is large, open and general enough to cover several interesting use cases. ... [We] hope that the dataset, code and baseline model provided by EMBER will help invigorate machine learning research for malware detection, in much the same way that benchmark datasets have advanced computer vision research.”

The liability involved with the availability of such open data sets is something researchers involved with EMBER say they have thought through and that the hope is openness will outweigh the risks.

