=========================================================================== S&P 2019 Review #199A --------------------------------------------------------------------------- Paper #199: Are Free Android App Security Analysis Tools Effective in Detecting Known Vulnerabilities? --------------------------------------------------------------------------- Overall merit: 2. Reject - This paper is well below the bar for Oakland. Will argue to reject it. ===== Brief paper summary (2-3 sentences) ===== The paper evaluates free android security tools against the curated Ghera data set which covers many known vulnerabilities in Android apps. The authors spend significant attention to arguing that the final results are representative across the Android app ecosystem. ===== Strengths ===== The authors identified and evaluated a set of free android security tools that can be applied against Android apps without requiring significant configuration. ===== Weaknesses ===== The paper lacks clear scientific contributions. A significant amount of space is devoted to arguing that Ghera is representative of all vulnerabilities in the Android ecosystem but the argumentation does not seem completely sound. ===== Detailed comments for the author(s) ===== A study of android security tools against ground truth is interesting and would be a valuable analysis for anyone looking to understand how well these security tools work against known vulnerabilities. The authors cover too different problem areas: vulnerabilities in android apps as well as malicious behavior from android apps. These problem areas seem fairly unrelated to each other and the paper would benefit from a more detailed explanation. It seems that the most common occurrence of malicious behavior is because the developer intentionally included that malicious behavior in their app. If the paper is meant to cover a different problem scenario, e.g. the inclusion of third-party libraries that may hide malicious behavior, it would be helpful to explicitly address that. The section on the representativeness of Ghera is quite prolonged. However, it does not really address the core question, e.g. are the vulnerabilities included in Ghera representative of vulnerabilities found in real Android apps. The paper makes the assumption that the usage of APIs is a good proxy for vulnerabilities. I don't believe that to be the case and the conclusions derived from the API analysis as a result do not seem convincing enough to answer the original question that was posed about the vulnerabilities included in Ghera. The overall discussion on APIs was also a little bit confusing as the paper did not really include a concise definition of what is being considered an API. When evaluating the security tools, the paper makes a number of quite big assumptions. One of them is that the tools should work without any configuration. Why was that a good assumption to make? And does that assumption align with how the security tools that were studied are supposed to be used? The paper reads a little bit like a collection of steps the authors conducted in their evaluation. This takes a lot of space without conveying deep insights about the problem space. It would have been more helpful to look into the individual security tools in more detail to better understand why they did or did not detect the various vulnerabilities. Without such an analysis, it is not clear that a lot can be learned from the current study. The summary finding is essentially that existing security tools don't find all vulnerabilities. In some sense that seems like a foregone conclusion. However, it would be very interesting to understand in more detail why that is the case. Perhaps, that could be a venue to explore for improving this paper. =========================================================================== S&P 2019 Review #199B --------------------------------------------------------------------------- Paper #199: Are Free Android App Security Analysis Tools Effective in Detecting Known Vulnerabilities? --------------------------------------------------------------------------- Overall merit: 3. Weak reject - The paper has flaws, but I will not argue against it. ===== Brief paper summary (2-3 sentences) ===== The paper evaluates a list of android app vulnerability detection tools on how effective they are to identify known vulnerabilities. For this, the authors use the Ghera benchmark which includes Android applications with known vulnerabilities. Before presenting the results of their evaluation, an extensive analysis of the representativeness of the Ghera benchmark is provided. As a result, they find out that 50% of the realworld applications use the APIs used by the applications in the Ghera benchmark. The evaluation done on 28 free Android applications results in a number of findings such as vulnerability detector tools are not able to identify class of vulnerabilities they claim to etc. ===== Strengths ===== - The authors invest a considerable amount of engineering work to prove the representativeness of the Ghera benchmark. - The main goal of studying the current state of real world vulnerability detectors is interesting. ===== Weaknesses ===== - The evaluation of the vulnerability detectors are kept very short and it is incomplete. - The evaluation of the malicious tools detectors is incomplete and not relevant to the paper. ===== Detailed comments for the author(s) ===== - The main motivation of the paper is that the current state of vulnerability detection tools available for Android app developers is an understudied phenomena and this needs further exploration. The academia has been working on sophisticated techniques to identify known and unknown vulnerabilities in Android apps. However, they haven't adequately looked at the problem of how much their solutions are adopted by real-world applications and how far the reality is from what is discussed in the academical community. Although, I don't think, a researcher who has been contributing to the topic is also responsible for making it sure the outcome is exploited in real life, obtaining a knowledge about the current state of vulnerability detectors out there is important. Therefore, I see a value on this paper. Being said that, I don't think it is fair to criticize academicians because they compare their new solutions with existing work on academia abandoning existing available tools in app stores and also because they do not provide practical solutions that are easy to understand by app developers. The goal of a researcher is to improve the state-of-the art and compare itself with the most current, the most efficient and effective technique that was proposed earlier, not the simplistic techniques that are focusing only on the detection of known vulnerabilities. I suggest the authors to remove the parts of the paper where they criticize the existing works [4] and [5] in the paper due to these reasons and instead, focus more on a nice discussion regarding how the security of mobile applications could have been improved if these tools very designed to be more friendly, easier to configure and acted like what they promised in their documentations. - The authors critisize the existing works because they do not compare themselves with real-world tools, however, this paper itself does the same by doing the opposite. Wouldn't this study have been more complete if the authors also discussed the academical solutions evaluating them on the same benchmark and touch base on the topic of why these methods are not developed in practice and what we should do to reach out to the community better. - I appreciated the section where the authors discuss the representativeness of the vulnerabilities in the Ghera benchmark, even though I found it overly long. However, as we can see from the results, the Ghera benchmark does not have the best coverage for all existing vulnerabilities and this makes the outcome of this discussion questionable. To be more precise, 8 of the vulnerability detection tools claimed to identify other vulnerabilities that are not found in the Ghera benchmark. It is not accurate to put these tools into the category of tools that cannot detect vulnerabilities as they are designed for other purposes. - The authors invested majority of their time to prove the representativeness of the Ghera benchmark and therefore, did not spend enough time on the evaluation of the tools which is supposed to be the core goal of the paper. To simplify the evaluation and produce results in a short period, they had to filter out a lot of applications due to various reasons such as having complicated configurations, being commercial etc. In the beginning, the authors claim that they have evaluated a large list of android vulnerability scanners, however, at the end only 28 of them were evaluated on the Ghera data set and some of them were identifying other vulnerabilities that are not present in the benchmark. In particular, the authors evaluate only free applications. I believe the study would have been more complete if the commercial apps were also evaluated. I understand that one of the goals of the paper is to evaluate these tools from the perspective of usability and ease of configurability, however a more security aware developer might prefer to pay for it if the commercial tools are more effective. - The authors do not express the novelty in their findings. For example, did we know that most of the applications are not even able to cover the identification of known vulnerabilities? This is interesting. However, the authors do not discuss the details about these vulnerabilities that are identified or missed. Which kind of vulnerabilities are generally missed by the tools? Are the tools generally identifying the same set of vulnerabilities or different ones. - After all the authors intentionally picked tools that perform shallow analysis and do not require additional input from developers regarding details from the source code or annotations on the code. Therefore, I am not surprised about the low true positives. I think to be more fair to the existing more sophisticated tools, the authors could implement an application with existing interesting vulnerabilities and provide the required input for these more sophisticated tools and evaluate them. Or if the source code of the apps in the benchmark is available, they could analyze some of them and provide the required information. - To summarize, I in general liked the paper, however found it incomplete. I don't see why it wasn't possible for the authors to complete the items in the future works sections of the paper. Adding some more commercial apps, more sophisticated tools to the analysis and then presenting the results using them will make the paper stronger. Some additional comments and questions: - "While software development community has recently realized the importance of security, developer awareness about how security issues transpire and how to avoid them is still lacking" : The study of the effectiveness of automatic vulnerability scanners is not at all a novel topic and here, this motivating sentence in this paper is not accurate. The software development community is not at all new to the concept of security, even for the mobile software community. - The paper evaluates 64 vulnerability and malicious detection tools for Android apps, why did the authors include the malicious detection tools to their analysis as well? This is very irrelevant to the goal of the paper. - Why did the authors only use the lean benchmarks of Ghera, excluding the real world apps with vulnerabilities from the paper? Wouldn't it be more realistic to test these vulnerability tools with real applications rather than stripped down apps? - Why did the authors make the distinction of relevant and security relevant APIs? Nitpick: "For this tools evaluation to be useful to tool users, tool developers, and researchers" -- fix this. "a weak proxy for release date of apps." - a weak approximation ... =========================================================================== S&P 2019 Review #199C --------------------------------------------------------------------------- Paper #199: Are Free Android App Security Analysis Tools Effective in Detecting Known Vulnerabilities? --------------------------------------------------------------------------- Overall merit: 3. Weak reject - The paper has flaws, but I will not argue against it. ===== Brief paper summary (2-3 sentences) ===== In this paper, the authors have evaluated the effectiveness of vulnerability (and malicious behavior) detection tools for some Android apps. The authors considered 64 security analysis tools and empirically evaluated 19 of them — 14 vulnerability detection tools and 5 malicious behavior detection tools — against 42 known vulnerabilities captured by benchmarks in Ghera repository. In the paper, they report their findings and try to provide some insights. ===== Strengths ===== - A lot of work has been published in the domain of Android app security, but this is still an important area in security. The paper addresses an interesting problem, I think. - The paper is quite easy to understand and follow. ===== Weaknesses ===== - This is the first time I am reading an anonymized paper where parts of the paper have been blackened out. It feels like the paper was deanonymized by just censoring certain parts of it. This actually makes the paper more difficult to understand, and it raises questions. For example, what is the answer for why the evaluation took more than a year? Why has that part been partially censored? - As the authors state, their selection of tools and the configurations could have really been biased by their preferences and their know-how. Starting out with 64 tools and bringing it down to 19 takes away quite a bit from the value of the evaluation and the study. - Some of the insights distilled by the authors are a little superficial. For example, the claim "there exists a gap between the claimed capabilities and the observed capabilities of tools that could lead to vulnerabilities in apps" is generally not surprising in the domain of security. Similarly, this reported insight: "most tools prefer to report only valid vulnerabilities or most tools can only detect specific manifestations of vulnerabilities" is also not very deep and I am not sure what we can learn from it. ===== Detailed comments for the author(s) ===== All in all, I think this is not a bad paper, but compared to everything else I've reviewed at Oakland, I am not sure this makes the bar. The insights the authors have gained are limited, and it is clear that the evaluation might have significant biases that have been introduced by the authors (e.g., how the tools have been configured, a misunderstanding of documentation, etc.). I think the anonymization needs some significant work. Just blocking out parts of the paper does not really work. The reader is then left pondering what the blocked out text says. If it is not important, why is it there? My understanding was that some tools were rejected because they were commercial. I understand that they may be expensive or difficult to get, but if such tools were eliminated from the batch, then the general title "are tools effective" does not really hold. This is also true for a bunch of other tools that were eliminated from the study for a variety of reasons. The more fitting title would be something like "an analysis of 19 tools for...".