Large Open-Source Data Set Released to Help Train Algorithms Spot Malware

Home / News

Large Open-Source Data Set Released to Help Train Algorithms Spot Malware

By CircleID Reporter
April 19, 2018, 7:20 pm PDT Views: 13,667 Add Comment

For the first time, a large dataset has been released by a security firm to help AI research and training of machine learning models that statically detect malware. The data set released by cybersecurity firm Endgame is called EMBER is a collection of more than a million representations of benign and malicious Windows-portable executable files. Hyrum Anderson, Endgame’s technical director of data science who worked on EMBER, says: “This dataset fills a void in the information security machine learning community: a benign/malicious dataset that is large, open and general enough to cover several interesting use cases. ... [We] hope that the dataset, code and baseline model provided by EMBER will help invigorate machine learning research for malware detection, in much the same way that benchmark datasets have advanced computer vision research.”

The liability involved with the availability of such open data sets is something researchers involved with EMBER say they have thought through and that the hope is openness will outweigh the risks.

NORDVPN DISCOUNT - CircleID x NordVPN
Get NordVPN [74% +3 extra months, from $2.99/month]

By CircleID Reporter — CircleID’s internal staff reporting on news tips and developing stories. Do you have information the professional Internet community should be aware of? Contact us.
Visit Page

Filed Under

Comments

The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.