CLAPP: Characterizing Loops in Android Applications

ABSTRACT

                 

        Many works have been proposed to analyze loops to perform different tasks, ranging from compiler optimization to Worst-Case Execution Time (WCET) analysis. While these approaches are powerful, they focus on tackling very specific categories and known patterns of loops, such as the ones for which the number of iterations can be statically determined. 

        In this work, we developed a static analysis framework to characterize and analyze generic loops, without relying on techniques based on pattern matching. For this work, we focus on the Android platform, and we implemented a prototype, called CLAPP, that we used to perform the first large-scale empirical study on the usage of loops in Android applications. In particular, we used our tool to analyze a total of 4,110,510 loops found in 15,240 Android applications. As part of our evaluation, we provide the detailed results of our empirical study, we show how our analysis was able to determine that the execution of 63.28% of the loops is bounded, and we discuss several interesting insights related to the performance and security aspects.

TECHNICAL DETAILS

 

Preliminary Steps: First, the Dalvik bytecode (DEX) of the application is parsed and disassembled. We compute the Control-Flow Graph (CFG), and we lift the bytecode to an internally-developed SSA-based intermediate representation (IR) that is more suitable for performing static analysis. We will use the snippet of code as shown below as a running example (Fig-1) to clarify our approach. The corresponding CFG (Fig-2) is shown on the right.

Fig-2: CFG of Running Example

Loop Identification: Second, we use existing algorithms to find loops in CFG. In the context of our work, a loop is represented as set of basic blocks, one of which is the header, that can be seen as the entry point of the loop. More formally, the header is the basic block that dominates the execution of all other blocks belonging to the loop. For the CFG in Fig-2, The loop header would be B4, and the body would be constituted by the B4, B5 and B6.

Fig-3: Expression Tree for Register R7

Selective Abstract Interpretation: Finally, we perform Selective Abstract Interpretation (SAI), on top of the expression trees computed for each register. In particular, this analysis step annotates each node of the expression tree to characterize how the value of a given register is initialized, how it evolves during each iteration. The result of this step is an annotated expression tree. An annotation for a node consists in the assignment of a series of labels. For more details please refer paper. In our example, The annotations for register R7 ( whose Expression tree is shown in Fig-3) the value of R7 inside the loop is not statically bound, it is not Fixed, it is Positive and it eventually increases.

Loop Body Characterization: From a high-level point of view, our analysis aims to determine which framework APIs might be possibly invoked (intra- and inter-procedurally) from within the body of the loop. To do this, We compute the transitive closure of all the possible invocations of framework methods. Each framework method is associated with a fine-grained label that indicates what type of operation it performs. The labels of reachable framework methods within the loop characterize the loop body. Please refer to the paper for more details. For our example in Fig-1, labels would be: (collectionSize, iteratorAccess).

 

USAGE

 RESULTS

Dataset:

To build our dataset, we considered the applications collected by the PlayDrone project. Essentially, PlayDrone is a crawler for the official Play Store appstore, and it has been used to crawl more than one million applications. In particular, we selected, at random, a subset of 15,240 applications. These applications span over several categories on the market store, and they contain hundreds to several thousands methods, depending on the
complexity of the application and the libraries they include.

High level Stats:

Application Analysis Time limit: 30 min.

Total Applications: 15,240

Successfully Analyzed:11,823 (77.57%)

Applications Time out: 3,417 (22.43%)

Total Loops Analyzed:4,110,510

Total framework methods reachable: 18,190,014

Average Application analysis time: 96.77 sec

Average Loop analysis time: 50.86 sec

Loop Control and Bound Analysis:

We classify loop control in 2-dimensions (Table 1) . The First dimension is based on number of exit paths. We say that a loop is simple, if it has only one exit path, with one condition. Otherwise, the loop is complex.

The second dimension is based on the loop bound.

Bounded: These loops are guaranteed to terminate.

Risky: These are loops that are implemented so that, independently from whether they terminate or not, a subtle change in the loop's body might cause the loop to become infinite. For example, consider the loop: for (i=0; i != 12; i+=3){..},  in this case much safer alternative would be for(i=0,i<12;i+=3){..}.

Unknown: These are the loops, for which we cannot determine whether it will terminate or not. One of the root causes for which our static analysis framework cannot determine whether a loop is bounded or not is constituted by the fact that the number of iterations of a loop can be influenced by the return value of a method invocation. Although, we model return value of iterators (refer paper), there are other framework methods for which we don't have models yet. However, we believe  that with considerable engineering effort these could be handled in similar way.

Potentially Infinite: These loops indicates that the analyzer identified that the exit path condition cannot be satisfied. Clearly, this aspect might be an indication that the loop is an infinite loop, but it is not a sufficient condition. As explained in paper, it could be that these are instances of poorly-implemented loops for which the execution breaks out from the loop through exception-related edges or by updating some variables from concurrent threads.

Not Supported: These loops rely on switch-like bytecode instructions (instead of the simpler if-* bytecode  instruction), which our prototype currently does not fully support.

For space constraints, we opted to omit the results related to the loop body analysis. Please refer to the paper for all the details.
Fig-1: Running Example

Expression Tree: Fourth, for each if- instruction in each exit path, We first determine which are the registers that are used, and then reconstruct which are the operations that update their values after each iteration. In particular, the analysis reconstructs, for each register an expression tree that encodes how the register is updated after each iteration. In particular, each node of the tree is characterized by a type of operation (e.g., scalar addition) and by its operands, which, in turn, can be a value (e.g., a scalar value, the return value of a method invocation), or another expression tree. For example: The expression tree associated to register R7 is shown in Fig-3.

Exit Path Identification: Third, we identify all possible exit paths. We define an exit path as a set of conditions that need to be satisfied so that the execution of the loop terminates. Naturally, a loop might have multiple exit paths. For example, the snippet of code in Fig-1 has two exit paths: 1) (i >= l.size()); 2) (i < i.size(), l.get(i) == null). Similarly For Fig 2, The exit paths are: (if-ge R7,R6) and ((!if-ge R7,R6), (if-eqz R8))

For more details, please contact:

  • Yanick Fratantonio (yanick@cs.ucsb.edu)
  • Aravind Machiry (machiry@cs.ucsb.edu)
  • Antonio Bianchi (antoniob@cs.ucsb.edu)
  • Christopher Kruegel (chris@cs.ucsb.edu)
  • Giovanni Vigna (vigna@cs.ucsb.edu)

Quick Steps:

  • Make sure you have VirtualBox installed on your system.
  • Download the VM (in OVA format) from link.
  • Import the Downloaded OVA into VirtualBox: File -> Import Appliance -> specify the file path to the .ova file.
  • Use the credentials specified through easychair submission or contact us.
  • In the VM:
    • Open a Terminal window.
    • $ cd /home/clapp/clapp
    • $ source ~/virtenv/clappenv/bin/activate
    • $ python loop_analysis.py -i  ~/samples/app-with-samples.apk -o /tmp/report.json --html /tmp/report.html
    • After the command completes, files /tmp/report.json and /tmp/report.html contain analysis results in JSON and HTML format respectively.
  • A sample app and corresponding output HTML is at link.

Overview: We prepared a VirtualBox VM that contains the tool and some sample Android applications. The tool takes as input the file path to an Android application (.apk file), and it returns as output two files: a JSON file (.json), which
contains the full, /raw/ results of the loop analysis; a human-readable report in HTML which summarizes the highlight of the analysis. We suggest the reviewer of this artifact to consult the HTML report: we believe we included all the
relevant/interesting details about our analysis (both for the bound and the body analysis). In case the reviewer is interested in the raw results, then we suggest to use an online JSON viewer. According to our experience, the JSON
viewer at http://json.parser.online.fr/ seems to be the best one.

 

Importing The Virtual Machine: The VM is accessible at this link. We built the VM to be run within VirtualBox.

  • The VM is in the OVA format, so to make the "import" as simple as possible. It should be sufficient to go to File -> Import Appliance -> specify the file path to the .ova file.
  • We suggest to give to the VM at least 4GB of RAM. For obvious reasons, the more the better.

Accessing The Virtual Machine: Once the VM booted, it is possible to access the tool by using the credentials
specified through the easychair submission form.

 

Running the tool: You can find the tool in the /home/clapp/clapp folder. To run the tool, it is sufficient to run the following command:

   

     $ cd /home/clapp/clapp

     $ source ~/virtenv/clappenv/bin/activate

     $ python loop_analysis.py -i  ~/samples/app-with-samples.apk -o /tmp/report.json --html /tmp/report.html

  • -i option specifies the file path to the APK
  • -o option specifies the file path to the (output) JSON file
  • --html option specifies the file path to the (output) HTML report.

The main analysis script contains many more command line switches, and, more in general, the "clapp" folder contains many more scripts. These options and additional scripts are not directly useful for this evaluation. In fact, they were mostly developed for testing purpouses, and to run the tool over cloud nodes for the large-scale analysis we did for the paper (over more than 10K applications).


Samples: You can analyze any Android application by using above command. However, to make the initial testing easier, we included several sample APK files in our VM. In particular, we added app-with-samples.apk (which is an Android application we
wrote to show case the different kind of analysis capabilities), and we also added a subset of 20 applications selected at random from the dataset we used for the paper. These samples can be found in the /home/clapp/samples folder. Moreover, we also pre-processed all these APKs, and we added the results (both the JSON and the HTML file) in the folder /home/clapp/samples-results.