===========================================================================
                           S&P 2019 Review #199A
---------------------------------------------------------------------------
Paper #199: Are Free Android App Security Analysis Tools Effective in
            Detecting Known Vulnerabilities?
---------------------------------------------------------------------------

                      Overall merit: 2. Reject - This paper is well below
                                        the bar for Oakland.  Will argue to
                                        reject it.

              ===== Brief paper summary (2-3 sentences) =====

The paper evaluates free android security tools against the curated Ghera data set which covers many known vulnerabilities in Android apps. The authors spend significant attention to arguing that the final results are representative across the Android app ecosystem.

                           ===== Strengths =====

The authors identified and evaluated a set of free android security tools that can be applied against Android apps without requiring significant configuration.

                          ===== Weaknesses =====

The paper lacks clear scientific contributions. A significant amount of space is devoted to arguing that Ghera is representative of all vulnerabilities in the Android ecosystem but the argumentation does not seem completely sound.

              ===== Detailed comments for the author(s) =====

A study of android security tools against ground truth is interesting and would be a valuable analysis for anyone looking to understand how well these security tools work against known vulnerabilities.

The authors cover too different problem areas: vulnerabilities in android apps as well as malicious behavior from android apps. These problem areas seem fairly unrelated to each other and the paper would benefit from a more detailed explanation. It seems that the most common occurrence of malicious behavior is because the developer intentionally included that malicious behavior in their app. If the paper is meant to cover a different problem scenario, e.g. the inclusion of third-party libraries that may hide malicious behavior, it would be helpful to explicitly address that.

The section on the representativeness of Ghera is quite prolonged. However, it does not really address the core question, e.g. are the vulnerabilities included in Ghera representative of vulnerabilities found in real Android apps. The paper makes the assumption that the usage of APIs is a good proxy for vulnerabilities. I don't believe that to be the case and the conclusions derived from the API analysis as a result do not seem convincing enough to answer the original question that was posed about the vulnerabilities included in Ghera. The overall discussion on APIs was also a little bit confusing as the paper did not really include a concise definition of what is being considered an API.

When evaluating the security tools, the paper makes a number of quite big assumptions. One of them is that the tools should work without any configuration. Why was that a good assumption to make? And does that assumption align with how the security tools that were studied are supposed to be used?

The paper reads a little bit like a collection of steps the authors conducted in their evaluation. This takes a lot of space without conveying deep insights about the problem space. It would have been more helpful to look into the individual security tools in more detail to better understand why they did or did not detect the various vulnerabilities. Without such an analysis, it is not clear that a lot can be learned from the current study. The summary finding is essentially that existing security tools don't find all vulnerabilities. In some sense that seems like a foregone conclusion. However, it would be very interesting to understand in more detail why that is the case. Perhaps, that could be a venue to explore for improving this paper.

===========================================================================
                           S&P 2019 Review #199B
---------------------------------------------------------------------------
Paper #199: Are Free Android App Security Analysis Tools Effective in
            Detecting Known Vulnerabilities?
---------------------------------------------------------------------------

                      Overall merit: 3. Weak reject - The paper has flaws,
                                        but I will not argue against it.

              ===== Brief paper summary (2-3 sentences) =====

The paper evaluates a list of android app vulnerability detection tools on how
effective they are to identify known vulnerabilities. For this, the authors use
the Ghera benchmark which includes Android applications with known
vulnerabilities. Before presenting the results of their evaluation, an
extensive analysis of the representativeness of the Ghera benchmark is provided.
As a result, they find out that 50% of the realworld applications use the APIs
used by the applications in the Ghera benchmark. The evaluation done on 28 free
Android applications results in a number of findings such as vulnerability
detector tools are not able to identify class of vulnerabilities they claim to
etc.

                           ===== Strengths =====

- The authors invest a considerable amount of engineering work to prove the
  representativeness of the Ghera benchmark.

- The main goal of studying the current state of real world vulnerability
  detectors is interesting.

                          ===== Weaknesses =====

- The evaluation of the vulnerability detectors are kept very short and it is
  incomplete.
- The evaluation of the malicious tools detectors is incomplete and not
  relevant to the paper.

              ===== Detailed comments for the author(s) =====

- The main motivation of the paper is that the current state of vulnerability
  detection tools available for Android app developers is an understudied
phenomena and this needs further exploration. The academia has been working on
sophisticated techniques to identify known and unknown vulnerabilities in
Android apps. However, they haven't adequately looked at the problem of how
much their solutions are adopted by real-world applications and how far the
reality is from what is discussed in the academical community. Although, I
don't think, a researcher who has been contributing to the topic is also
responsible for making it sure the outcome is exploited in real life, obtaining
a knowledge about the current state of vulnerability detectors out there is
important. Therefore, I see a value on this paper. Being said that, I don't
think it is fair to criticize academicians because they compare their new
solutions with existing work on academia abandoning existing available tools in
app stores and also because they do not provide practical solutions that are
easy to understand by app developers. The goal of a researcher is to improve
the state-of-the art and compare itself with the most current, the most
efficient and effective technique that was proposed earlier, not the simplistic
techniques that are focusing only on the detection of known vulnerabilities. I
suggest the authors to remove the parts of the paper where they criticize the
existing works [4] and [5] in the paper due to these reasons and instead, focus
more on a nice discussion regarding how the security of mobile applications
could have been improved if these tools very designed to be more friendly,
easier to configure and acted like what they promised in their documentations.

- The authors critisize the existing works because they do not compare
  themselves with real-world tools, however, this paper itself does the
same by doing the opposite. Wouldn't this study have been more complete if the
authors also discussed the academical solutions evaluating them on the same
benchmark and touch base on the topic of why these methods are not developed in
practice and what we should do to reach out to the community better.

- I appreciated the section where the authors discuss the representativeness of
  the vulnerabilities in the Ghera benchmark, even though I found it overly
long. However, as we can see from the results, the Ghera benchmark does not
have the best coverage for all existing vulnerabilities and this makes the
outcome of this discussion questionable. To be more precise, 8 of the
vulnerability detection tools claimed to identify other vulnerabilities that
are not found in the Ghera benchmark. It is not accurate to put these tools
into the category of tools that cannot detect vulnerabilities as they are
designed for other purposes.

- The authors invested majority of their time to prove the representativeness
  of the Ghera benchmark and therefore, did not spend enough time on the
evaluation of the tools which is supposed to be the core goal of the paper. To
simplify the evaluation and produce results in a short period, they had to
filter out a lot of applications due to various reasons such as having
complicated configurations, being commercial etc. In the beginning, the authors
claim that they have evaluated a large list of android vulnerability scanners,
however, at the end only 28 of them were evaluated on the Ghera data set and
some of them were identifying other vulnerabilities that are not present in the
benchmark. In particular, the authors evaluate only free applications. I
believe the study would have been more complete if the commercial apps were
also evaluated. I understand that one of the goals of the paper is to evaluate
these tools from the perspective of usability and ease of configurability,
however a more security aware developer might prefer to pay for it if the
commercial tools are more effective.

- The authors do not express the novelty in their findings. For example, did we
  know that most of the applications are not even able to cover the
identification of known vulnerabilities? This is interesting. However, the
authors do not discuss the details about these vulnerabilities that are
identified or missed. Which kind of vulnerabilities are generally missed by the
tools? Are the tools generally identifying the same set of vulnerabilities or
different ones.

- After all the authors intentionally picked tools that perform shallow
  analysis and do not require additional input from developers regarding
details from the source code or annotations on the code.  Therefore, I am not
surprised about the low true positives. I think to be more fair to the existing
more sophisticated tools, the authors could implement an application with
existing interesting vulnerabilities and provide the required input for these
more sophisticated tools and evaluate them. Or if the source code of the apps
in the benchmark is available, they could analyze some of them and provide the
required information.

- To summarize, I in general liked the paper, however found it incomplete. I
  don't see why it wasn't possible for the authors to complete the items in
the future works sections of the paper. Adding some more commercial apps, more
sophisticated tools to the analysis and then presenting the results using them will
make the paper stronger.

Some additional comments and questions:

- "While software development community has recently realized the importance of
  security, developer awareness about how security issues transpire and how to
avoid them is still lacking" : The study of the effectiveness of automatic
vulnerability scanners is not at all a novel topic and here, this motivating
sentence in this paper is not accurate. The software development community is
not at all new to the concept of security, even for the mobile software community.

- The paper evaluates 64 vulnerability and malicious detection tools for
  Android apps, why did the authors include the malicious detection tools to
their analysis as well? This is very irrelevant to the goal of the paper.

- Why did the authors only use the lean benchmarks of Ghera, excluding the real
  world apps with vulnerabilities from the paper? Wouldn't it be more realistic
to test these vulnerability tools with real applications rather than stripped
down apps?

- Why did the authors make the distinction of relevant and security relevant
  APIs?

Nitpick:

"For this tools evaluation to be useful to tool users, tool developers, and
researchers" -- fix this.

"a weak proxy for release date of apps." - a weak approximation ...

===========================================================================
                           S&P 2019 Review #199C
---------------------------------------------------------------------------
Paper #199: Are Free Android App Security Analysis Tools Effective in
            Detecting Known Vulnerabilities?
---------------------------------------------------------------------------

                      Overall merit: 3. Weak reject - The paper has flaws,
                                        but I will not argue against it.

              ===== Brief paper summary (2-3 sentences) =====

In this paper, the authors have evaluated the effectiveness of vulnerability
(and malicious behavior) detection tools for some Android apps. The authors
considered 64 security analysis tools and empirically evaluated 19 of them — 14
vulnerability detection tools and 5 malicious behavior detection tools — against
42 known vulnerabilities captured by benchmarks in Ghera repository. In the
paper, they report their findings and try to provide some insights.

                           ===== Strengths =====

- A lot of work has been published in the domain of Android app security, but
  this is still an important area in security. The paper addresses an
  interesting problem, I think.

- The paper is quite easy to understand and follow.

                          ===== Weaknesses =====

- This is the first time I am reading an anonymized paper where parts of the
  paper have been blackened out. It feels like the paper was deanonymized by
  just censoring certain parts of it. This actually makes the paper more difficult
  to understand, and it raises questions. For example, what is the answer for why
  the evaluation took more than a year? Why has that part been partially
  censored?

- As the authors state, their selection of tools and the configurations could
  have really been biased by their preferences and their know-how. Starting out
  with 64 tools and bringing it down to 19 takes away quite a bit from the value
  of the evaluation and the study.

- Some of the insights distilled by the authors are a little superficial. For
  example, the claim "there exists a gap between the claimed capabilities and
  the observed capabilities of tools that could lead to vulnerabilities in apps"
  is generally not surprising in the domain of security. Similarly, this
  reported insight: "most tools prefer to report only valid vulnerabilities or
  most tools can only detect specific manifestations of vulnerabilities" is also
  not very deep and I am not sure what we can learn from it.

              ===== Detailed comments for the author(s) =====

All in all, I think this is not a bad paper, but compared to everything else
I've reviewed at Oakland, I am not sure this makes the bar. The insights the
authors have gained are limited, and it is clear that the evaluation might have
significant biases that have been introduced by the authors (e.g., how the tools
have been configured, a misunderstanding of documentation, etc.).

I think the anonymization needs some significant work. Just blocking out parts
of the paper does not really work. The reader is then left pondering what the
blocked out text says. If it is not important, why is it there?

My understanding was that some tools were rejected because they were commercial.
I understand that they may be expensive or difficult to get, but if such tools
were eliminated from the batch, then the general title "are tools effective"
does not really hold. This is also true for a bunch of other tools that were
eliminated from the study for a variety of reasons. The more fitting title would
be something like "an analysis of 19 tools for...".