NDSS'21 Summer Paper #121 Reviews and Comments
===========================================================================
Paper #121 Peeler: Profiling Kernel-Level Events to Detect Ransomware


Review #121A
===========================================================================

Overall Recommendation
----------------------
2. Leaning towards reject

Writing Quality
---------------
4. Well-written

Reviewer Confidence
-------------------
3. Sufficient confidence

Paper Summary
-------------
This paper presents a ransomware detection system called Peeler based on monitoring kernel-level events.  Peeler combines rule-based methods for crypto ransomware and machine learning models for screen-lock ransomware to do the detection.  The evaluation shows that Peeler can achieve 99% detection rate with low system overhead.

Strengths
---------
+ Extensive study across different families/types of ransomware
+ The paper is well-written and easy to follow
+ Promising evaluation results

Weaknesses
----------
- Unclear if ransomware can adapt to the detection approaches here to bypass Peeler
- Unclear key insights that make Peeler significantly better than ransomware detection tools

Detailed Comments for Authors
-----------------------------
My biggest concern with Peeler is that ransomware may be able to adapt to the detection approaches presented in this paper and bypass the detection.  I can think of at least three possible ways to achieve this:

First, Peeler detects screen-locker ransomware based on the excessive number of spawned child processes, and new screen-locker ransomware could spawn fewer processes by "squeezing" multiple workloads into one process.  This may impact the performance of the ransomware, but it can bypass Peeler's detection.

Second, Peeler ignored the file I/O operations on temporary files based on the file extensions (e.g., bak, tmp), and new ransomware can simply rename the files to those ignored file extensions, and then perform the encryption on them.

Third, Peeler observed that benign crypto/compression software (e.g., ZipExtractor and BreeZip) doesn't restrict access to files via encryption while ransomware does, and used it as a signal to distinguish benign apps from ransomware.  However, new ransomware can simply delay the deletion of the original files to bypass the detection.

Peeler uses the number of spawn child processes to detect screen-locker ransomware.  But it is likely for benign applications to exhibit similar behaviors: a web server can spawn many child processes depending on the workload.  I am not convinced the number of child processes is a good characteristics that distinguishes screen-locker ransomware from other benign applications.  Besides, I am not sure I understand the purpose of screen-locker ransomware, wouldn't a user be able to recover all her files by booting from a different operating system and mounting the disk?

For the evaluation of detection time, what is the system workload like in the experiments?  If the ransomware is running in parallel to many other irrelevant system workloads, would it dramatically affect the detection time?

It is unclear to me what the key insights behind Peeler that make it more effective than other rule-based methods (e.g., reduced detection time).  These are important details for the readers to understand your contributions.


Review #121B
===========================================================================

Overall Recommendation
----------------------
2. Leaning towards reject

Writing Quality
---------------
3. Adequate

Reviewer Confidence
-------------------
4. High confidence

Paper Summary
-------------
Similar to Unveil and a number of other existing dynamic ransomware detection systems, Peeler collects a number of signals (I/O operations and other information provided by Windows already). Subsequently, Peeler uses a set of 4 hand-crafted regular expressions on I/O traces, and two machine learning classifiers to identify ransomware. The paper claims that screen locker ransomware and crypto-based ransomware must be treated with the same urgency. An evaluation on 43 malware families (most with exactly 1 specimen) shows that Peeler performs well.

Strengths
---------
- Ransomware is an important threat

Weaknesses
----------
- No novelty over prior work (other than faster detection, but importance thereof is not established)
- Labeling of sample set is insufficiently explained, with most families having exactly 1 sample, inconsistent AV labels are a potential pitfall
- Internally inconsistent w.r.t. the importance (or lack thereof) of multiple samples per family (i.e., experiment with unknown samples uses samples from KNOWN families)

Detailed Comments for Authors
-----------------------------
This paper correctly recognizes the consistently high importance the threat of ransomware poses to our computing infrastructure. As such, advancing the state of the art in that domain is highly relevant for a venue such as NDSS.

Unfortunately, however, it cannot be said that Peeler advances the state of the art substantively. There are two key differences with respect to prior work (i) detecting crypto-ransomware and screen-lockers at the same time, (ii) faster detection than prior schemes. Unfortunately, neither of these aspects are shows to be necessary.

(i) Unveil already detects crypto-ransomware and screen-lockers at the same time. The paper claims that Peeler is different because the detection is in one system rather than in two components that make the Unveil system. Given that the two detection approaches (i.e., I/O traces and ML classifiers) are entirely orthogonal in Peeler, this is a distinction without a difference. Very importantly, the paper incorrectly and without substantiating argument suggests that the detection of screen-lockers is equally important to detecting crypto-ransomware. Clearly, that stance is incorrect. Instead, prior work mainly focused on crypto-ransomware because (technically) recovering from screen-lockers is wholly trivial. No files are lost or encrypted and hence files can simply be copied out from infected drives. This is a fundamental difference between these two categories of ransomware and likely the reason why crypto-ransomware is preferred by attackers too.

(ii) While Peeler is said to detect crypto-ransomware after the first encrypted file, the paper unfortunately does not establish that losing the 9 files that CryptoLock or REDFISH, or the 4 files that Redemption would allow are crucial. While intuitively losing fewer files is better, this is insufficient to warrant another paper that relies on the exact same tools and techniques used by prior work.

Furthermore, the paper must improve the description of the screen-locker threats. First, the paper does not explain why screen-lockers HAVE TO spawn a lot of processes. No doubt they currently do, but there does not seem to be a reason why they have to. According to the Unveil paper there are merely a few API calls that a locker has to make to lock the screen --- certainly nothing that requires multiple processes. Also, claiming that screen-lockers get detected fast (16s) sometimes even before they do lock the screen (around 5 minutes) requires better explanation of what these samples actually do. Similar to the many processes, there is no apparent reason why a screen-locker would NEED 5 minutes to lock the screen. Current samples might take that time (likely to do other things in the meantime), but again there is no explanation of why that is NEEDED.

Additionally, the paper calls for the regular expressions to be kept confidential in a real-world deployment. Of course, where a system such as Peeler to become popular, there is no realistic way this confidentiality could be preserved. That is precisely the reason why no real system has such requirements. Furthermore, the regular expressions are trivial to evade, even in the absence of spurious read/write activities. A malware could simply memory map the destination file and then memcpy() the new contents there. The operating system would not see any write operation in that case and Peeler would be entirely blind (the same holds true for read operations).

Regarding samples, the paper has two important shortcomings. First, Table VI lists 34 crypto-ransomware families, with the vast majority containing exactly one specimen. Unfortunately, the paper does not explain how the labels for these samples were derived. This is concerning because inconsistent labels between AV vendors and even within the same vendor can easily lead to the inflation of the number of families. At the very least the paper should rely on AVClass to derive labels, and publish the cryptographic checksums of all samples. Second, the paper claims and agrees with [16] that multiple samples per family is not necessary to detect other samples of the same family. However, the paper then goes on to test a number of previously unseen samples against the pre-trained classifier. Given the new samples are from the same families as those used for training, it is entirely unsurprising that Peeler would detect other specimen of known families.

The final death knell in this paper is the claim that correlation coefficients can be used to establish causation. The paper should be augmented with citations for such claims before it can be considered for publication.


Review #121C
===========================================================================

Overall Recommendation
----------------------
2. Leaning towards reject

Writing Quality
---------------
3. Adequate

Reviewer Confidence
-------------------
3. Sufficient confidence

Paper Summary
-------------
The authors present a multi-stage ransomware detection mechanism based on file I/O pattern matching and statistical learning over distributional process behaviors. Common File I/O characteristics are identified through analysis of different encryption ransomware, while process tree complexity and syscall event correlations are used as the basis for an SVM feature vector. The authors report a 995 detection rate against 43 ransomware families.

Strengths
---------
- Considers important real-world problem
- Distills insights from behavior of known ransomware families

Weaknesses
----------
- Approach appears less resilient to adaptive attacks than prior work
- Multi-stage approach obscures the value of individual components
- Unclear conceptual contribution
- unbalanced data

Detailed Comments for Authors
-----------------------------
The authors identify several approaches based on syscall monitoring to detect ransomware, such as correlations between File I/O patterns and process tree size. Regardless of whether these metrics are novel, they certainly seem less resilient to evasion than prior work. An attacker could trivially condense their process tree to evade detection; definitionally, they have total control over that process space. The correlation coefficients for different syscalls could also be trivially disrupted by the attacker. The authors use brittle regular expressions to encode the suspicious file activities that can be thwarted by the insertion of chaff events (e.g., randomly rename the file). 

The ML-based component of Peeler is proposed primarily to handle screen locking ransomware that can't be matched against the file pattern regexes. See above for my comments about the process tree. It isn't clear to me how the event correlation feature vector is being build for the SVM classifier. Are these also be filtered first by object, or are the correlations being analyzed on the syscall ID without any consideration for the arguments? More detail is needed here, but regardless it appears that an attacker could distort these statistic properties with relative ease.

The dataset is dominated by just a few malware families, calling into question the validity of the authors experimental findings. 2 families comprise 52% of the dataset and 5 families comprise 75% of the dataset. The authors train on 20% of their data (seemingly selected at random by sample), making it likely that the classifier had an opportunity to train on samples from 5 families that dominate the dataset. The authors do not report on performance for previously-unseen malware samples. 

The writing places significant evidence on Peeler being a kernel-level detection mechanism, which is odd because many past approaches (most?) to ransomware detection are based in the kernel, e.g., CryptoDrop.

Nits:
- Please reduce the size of your images. I was reviewing this submission on an iPad Pro and it took a minute for each page to render as I advanced through the paper. Figure 7, for example, plots thousands of overlapping points even through they're completely illegible.
- You keep referring to the event correlations as causalities; shouldn't they be called correlations?