[HCOfficial] MalDet: An Anomaly-Statistics Based PE Malware Detector

lady_godiva · 08-14-2014, 03:42 PM

I have a question. How did you assign probabilities? I mean, did you follow some Machine Learning approach, analyzing different samples and using a classifier next? Or did you use some other approach?

I'm sorry but i didn't have time to look at the code yet, probably reading the code would give me an answer.

Deque · 08-15-2014, 08:54 PM

(08-14-2014, 03:42 PM)lady_godiva Wrote: I have a question. How did you assign probabilities? I mean, did you follow some Machine Learning approach, analyzing different samples and using a classifier next? Or did you use some other approach?

I'm sorry but i didn't have time to look at the code yet, probably reading the code would give me an answer.

No classifier, no machine learning, just plain statistic and stochastic. I collected statistical information and based on that I used Bayes' Theorem to calculate the conditional probability of a file being malicious. I can explain you the details, however, I am not sure if my approach is scientifically correct. I had a discussion with a professor in this field, who told me he would try to help me, but I didn't get any info so far. It seems that this isn't as easy as I thought it would be. I have made assumptions, e.g. the independence of probabilities that certain anomalies occur, which are probably not correct.

So, basically, I created something that is good enough to work in practice, but the scientific explanation is not yet sufficient.
It is part of my master thesis and would like you to wait until december for more details. My master thesis will be public after my graduation.
Thank you for your interest in my work.

Edit: Btw, the code that does the main calculation is just a few lines. I added comments for you to explain the details.

[

Code:
scala]  /**

* Calculates the probability for a file to be malicious based on the

* anomalies found in the file.

*

* @return probability P(BAD|Anomalies)

*/

  def malwareProbability(): Double = {

    val subtypes = anomalies.map(a => a.subtype).distinct

    // fetch the probabilities for every anomaly subtype

    // such a probability contains two values bad and good

    // bad == probability of a malicious file to have that single anomaly

    // good == probability of a harmless file to have that single anomaly

    val probs = subtypes.map(subtype => probabilities.get(subtype)).flatten 

    // allBad == probability for a malicious file to have all anomalies that where found

    // this is the probability P(Anomalies | BAD)

    val allBad = probs.foldRight(1.0) { (p, bad) => p.bad * bad }

    // allGood == probability for a harmless file to have all anomalies that where found

    // this is the probability P(Anomalies | GOOD)

    val allGood = probs.foldRight(1.0) { (p, good) => p.good * good }

    // calculates the probability for the file to be malicious with bayes theorem

    // this is the probability P(BAD | Anomalies)

    val bayes = allBad * 0.5 / (allGood * 0.5 + allBad * 0.5)

    bayes

  }

The real legwork was detection and collection of file anomalies.

Deque · 08-15-2014, 08:54 PM

(08-14-2014, 03:42 PM)lady_godiva Wrote: I have a question. How did you assign probabilities? I mean, did you follow some Machine Learning approach, analyzing different samples and using a classifier next? Or did you use some other approach?

I'm sorry but i didn't have time to look at the code yet, probably reading the code would give me an answer.

No classifier, no machine learning, just plain statistic and stochastic. I collected statistical information and based on that I used Bayes' Theorem to calculate the conditional probability of a file being malicious. I can explain you the details, however, I am not sure if my approach is scientifically correct. I had a discussion with a professor in this field, who told me he would try to help me, but I didn't get any info so far. It seems that this isn't as easy as I thought it would be. I have made assumptions, e.g. the independence of probabilities that certain anomalies occur, which are probably not correct.

So, basically, I created something that is good enough to work in practice, but the scientific explanation is not yet sufficient.
It is part of my master thesis and would like you to wait until december for more details. My master thesis will be public after my graduation.
Thank you for your interest in my work.

Edit: Btw, the code that does the main calculation is just a few lines. I added comments for you to explain the details.

[

Code:
scala]  /**

* Calculates the probability for a file to be malicious based on the

* anomalies found in the file.

*

* @return probability P(BAD|Anomalies)

*/

  def malwareProbability(): Double = {

    val subtypes = anomalies.map(a => a.subtype).distinct

    // fetch the probabilities for every anomaly subtype

    // such a probability contains two values bad and good

    // bad == probability of a malicious file to have that single anomaly

    // good == probability of a harmless file to have that single anomaly

    val probs = subtypes.map(subtype => probabilities.get(subtype)).flatten 

    // allBad == probability for a malicious file to have all anomalies that where found

    // this is the probability P(Anomalies | BAD)

    val allBad = probs.foldRight(1.0) { (p, bad) => p.bad * bad }

    // allGood == probability for a harmless file to have all anomalies that where found

    // this is the probability P(Anomalies | GOOD)

    val allGood = probs.foldRight(1.0) { (p, good) => p.good * good }

    // calculates the probability for the file to be malicious with bayes theorem

    // this is the probability P(BAD | Anomalies)

    val bayes = allBad * 0.5 / (allGood * 0.5 + allBad * 0.5)

    bayes

  }

The real legwork was detection and collection of file anomalies.

lady_godiva · 08-16-2014, 12:18 PM

(08-15-2014, 08:54 PM)Deque Wrote: I can explain you the details, however, I am not sure if my approach is scientifically correct. I had a discussion with a professor in this field, who told me he would try to help me, but I didn't get any info so far. It seems that this isn't as easy as I thought it would be. I have made assumptions, e.g. the independence of probabilities that certain anomalies occur, which are probably not correct.

So, basically, I created something that is good enough to work in practice, but the scientific explanation is not yet sufficient.
It is part of my master thesis and would like you to wait until december for more details.

Thanks for the answer, i'm looking forward to read your thesis once you're done.

If i can give you my 2 cents, you said you made an assumption regarding the independence of probabilities that certain anomalies occur. This assumption is correct for Naive Bayes classifier (which is not a Beyesian method) but is still very similar to Bayes' Theorem. Considering the anomalies independent is correct in my opinion.

So far, the scientific approach is correct, everything you did makes perfect sense. Of course, it must have been a pain assigning probabilites to each anomaly (and this is why i was asking about machine learning, as it would have made things easier), but once you have those probabilities and you can recognize them among PE files than it's all about calculations.

I'm confident that you can make it on your own, i got no doubt about that, anyway recently i made a similar work about Machine Learning techniques for malware detection on Android devices, so if i can be of any help don't hesitate writing me!