# CodeQL for Beginners

### **Introduction to CodeQL**

CodeQL is a powerful semantic code analysis engine developed by Semmle and later acquired by GitHub. It allows developers to query code as though it were data. If you think about it, this is a rather profound idea. Just as you might query a database to find specific information, you can use CodeQL to find specific patterns in your code.

For instance, suppose you want to find every `if` statement in your codebase that doesn't have a corresponding `else` statement. With traditional means, this would be quite a tedious task, but with CodeQL, you can just write a query to get all such instances.

Here's a simple example of what a CodeQL query might look like (note that this is pseudocode):

```codeql
from IfStmt ifstmt
where not(ifstmt.hasElseBranch())
select ifstmt, "This 'if' statement doesn't have an 'else' branch."
```

This simple query scans the codebase for `if` statements (`IfStmt`) that do not have an `else` branch (`not(ifstmt.hasElseBranch())`) and then selects those `if` statements, along with a message.

The primary use of CodeQL, however, is not merely to find syntactical patterns like in the example above but to identify more complex, semantic patterns that can highlight potential security vulnerabilities. In fact, it's one of the most powerful tools currently available for semantic code analysis in the context of security.

### **The Importance of Security Testing**

In today's era, where software forms the backbone of numerous critical systems, ensuring software security has never been more important. Security vulnerabilities in code not only pose a risk to data privacy, but they can also lead to financial losses and damage to an organization's reputation. Security testing forms the first line of defense against such vulnerabilities.

Semantic code analysis tools like CodeQL allow developers to analyze their code from a security standpoint. These tools can uncover complex security vulnerabilities by analyzing the meaning of code, which is something that traditional syntactic code scanners may miss. For instance, CodeQL can help you find SQL injection vulnerabilities, cross-site scripting (XSS) vulnerabilities, and more, even in a large and complex codebase.

You will find yourself gradually mastering this powerful tool, ready to take on the challenges of securing your codebase in an increasingly connected world.

### What is Code Analysis?

Code analysis, also known as static analysis, is a method of debugging by examining source code before a program is run. It's done by analyzing a set of code against a set (or multiple sets) of coding rules. Code analysis is an important aspect of software development as it not only helps in improving the quality of software but also accelerates the development process by identifying bugs at an early stage.

Code analysis can be done both manually and automatically. Manual code reviews are time-consuming and can be error-prone. Automated code analysis, on the other hand, offers an efficient and reliable alternative. CodeQL is an example of an automated code analysis tool.

> **Note:** CodeQL is particularly powerful because it performs semantic code analysis, which goes a step beyond mere syntactic analysis to understand the 'meaning' of code.

### How Code Analysis Improves Software Security

Code analysis plays a pivotal role in improving software security. By identifying potential vulnerabilities at the coding stage, it can help prevent security breaches that might occur when the software is in use. Some ways in which code analysis enhances software security include:

1. **Early Bug Detection:** Code analysis can uncover bugs and vulnerabilities early in the development process, even before the testing phase. This allows developers to fix problems before they can be exploited in a live environment.
2. **Automation:** Automated code analysis tools like CodeQL can scan large codebases quickly and efficiently, ensuring that no stone is left unturned in the hunt for potential security issues.
3. **Coding Standards Compliance:** Code analysis ensures that the code complies with standard coding practices. Following these practices can prevent a number of common security issues.
4. **In-depth Analysis:** Tools like CodeQL allow for deep, semantic analysis of code. This means they can find complex vulnerabilities that may be missed by simple syntactic analysis.

In this chapter, we'll go through the process of setting up your CodeQL environment. We'll cover the steps to download and install CodeQL, introduce you to the CodeQL command-line interface (CLI), and discuss setting up CodeQL for various Integrated Development Environments (IDEs).

## Setup

### Downloading and Installing CodeQL

Before you start with CodeQL, you need to download and install it. Here's how you can do it:

1. **Download the CodeQL CLI:** The CodeQL CLI can be downloaded from the [GitHub's CodeQL repository](https://github.com/github/codeql-cli-binaries/releases). Make sure to choose the version compatible with your operating system.
2. **Unpack the archive:** Once you've downloaded the archive, unpack it to a location of your choice.
3. **Add CodeQL to your PATH:** After unpacking, add the path of the `codeql` executable to your system's PATH environment variable. This will allow you to run CodeQL commands from anywhere.

Here's an example of how you can add CodeQL to your PATH on a Unix-like system:

```bash
export PATH=$PATH:/path/to/codeql
```

And here's how you can do it on Windows:

```powershell
$env:Path += ";C:\path\to\codeql"
```

Please replace `/path/to/codeql` and `C:\path\to\codeql` with the actual path to the `codeql` executable on your system.

### CodeQL CLI

The CodeQL command-line interface (CLI) is a powerful tool that allows you to run CodeQL queries, create databases for analysis, and perform a variety of other tasks.

Some basic commands you might find useful include:

* `codeql database create`: This command creates a new CodeQL database. You can specify the language of the database with the `--language` option.
* `codeql query run`: This command runs a CodeQL query on a database.

Here's an example of how you might use these commands:

```bash
# Create a new JavaScript database
codeql database create my-js-database --language=javascript --source-root ./my-js-project

# Run a query on the database
codeql query run ./my-query.ql --database my-js-database
```

In the example above, replace `./my-js-project` with the path to your JavaScript project, and `./my-query.ql` with the path to your CodeQL query.

### Setting Up CodeQL for Various IDEs

While you can run CodeQL queries from the command line, you might find it more convenient to use an Integrated Development Environment (IDE). Many popular IDEs support CodeQL either natively or through plugins.

#### Visual Studio Code

For instance, if you're using Visual Studio Code, you can install the [CodeQL for Visual Studio Code](https://marketplace.visualstudio.com/items?itemName=GitHub.vscode-codeql) extension. This extension provides CodeQL syntax highlighting, query help, and database management.

To install the extension, open Visual Studio Code and follow these steps:

1. Click on the Extensions view icon on the Sidebar (or press `Ctrl+Shift+X`).
2. Search for "CodeQL".
3. Click on Install.

Once installed, you can open a CodeQL database by clicking on "Choose Database from a Folder" in the Databases view. After opening a database, you can run a query by clicking on "Run Query" in the CodeQL Queries view.

## CodeQL Queries

### The Structure of CodeQL Queries

A typical CodeQL query has the following components:

1. **Import Statements:** CodeQL queries start with import statements. These statements import the CodeQL libraries for the specific language you're analyzing.
2. **From-Where-Select Blocks:** CodeQL queries retrieve data using from-where-select blocks.
   * The `from` clause defines a variable with a specific type.
   * The `where` clause sets a condition that the data needs to satisfy.
   * The `select` clause decides the final output of the query.
3. **Query Metadata:** Query metadata, enclosed in a comment block at the beginning of the query file, provides information about the query. It can include the purpose of the query, its author, and more.

Here's an example of a simple CodeQL query to find all Python functions named `execute`:

```python
import python

from Function f
where f.getName() = "execute"
select f
```

In this query, we import the CodeQL library for Python with `import python`. Then, we define a variable `f` of type `Function`, and in the `where` clause, we set a condition that the function's name should be "execute". Finally, we select the function.

### Understanding CodeQL Libraries

CodeQL has libraries for different programming languages. These libraries contain classes that represent various elements of the code you're analyzing, and predicates that represent the properties and relations of these elements.

For instance, the Python library includes classes such as `Function` for Python functions, `Class` for Python classes, and `Module` for Python modules. These classes have associated predicates. For example, the `Function` class has predicates like `getName()` to get the function's name and `getArgument(int i)` to get the function's `i`-th argument.

Here's an example query that uses the Python library to find all calls to a function named `execute`:

```python
import python

from Call c
where c.getFunction().getName() = "execute"
select c
```

In this query, we define a variable `c` of type `Call`, which represents a function call. In the `where` clause, we set a condition that the function being called should be named "execute". Finally, we select the call.

### Querying for Vulnerabilities

CodeQL is a powerful tool for finding vulnerabilities in code. It can help you find patterns in your code that could lead to security vulnerabilities.

For instance, consider the following Python code:

```python
@app.route('/api/data')
def api_data():
    param = request.args.get('param', '')
    results = db.session.execute('SELECT * FROM data WHERE name = %s' % param)
    ...
```

This code is vulnerable to SQL Injection because it directly includes a user-supplied parameter (`param`) in a SQL query.

We can write a CodeQL query to find similar vulnerabilities in a Python codebase:

```python
import python

from StrConst str, LocalVariable var, Expr use
where
  var.getAnAssignedValue() = str and
  var.getAUse() = use and
  use.getParent*() instanceof ExecStmt
select use, "This code may be vulnerable to SQL Injection."
```

In this query, we first define three variables: `str` of type `StrConst` (representing a string constant), `var` of type `LocalVariable` (representing a local variable), and `use` of type `Expr` (representing an expression).

The `where` clause sets three conditions:

1. The local variable `var` is assigned the string constant `str`.
2. The local variable `var` is used in the expression `use`.
3. The expression `use` is within an `ExecStmt` (representing a SQL execute statement).

Finally, we select the expression `use` and output a warning message.

The query uses the `getAnAssignedValue()` predicate of the `Variable` class to find the value assigned to the variable, the `getAUse()` predicate to find where the variable is used, and the `getParent*()` predicate to find the containing statement.

## CodeQL for Programming Languages

While CodeQL is language-agnostic in its core principles, different programming languages have unique features and vulnerabilities. In this chapter, we will explore how CodeQL is used with different programming languages, focusing on JavaScript, Python, and Java.

### CodeQL for JavaScript

JavaScript is often used in web applications, which are prime targets for security exploits. Let's write a CodeQL query to identify a common security vulnerability in JavaScript: Cross-Site Scripting (XSS).

XSS happens when untrusted input is directly included in output that gets rendered in a user's browser. For example, consider the following piece of Node.js code using the Express framework:

```javascript
app.get('/sayHello', (req, res) => {
  res.send('Hello, ' + req.query.name + '!');
});
```

The `name` query parameter is directly included in the response sent to the client. If it includes JavaScript code, this code gets executed in the user's browser.

Here's a CodeQL query that detects similar issues:

```javascript
import javascript

from Expr xssSink, DataFlow::Node source, DataFlow::TrackableSanitizer sanitizer
where 
  source.asExpr() = xssSink and
  not exists(DataFlow::Node mid |
    DataFlow::localFlow(source, mid) and
    sanitizer.sanitizes(mid)
  )
select xssSink, "This code may be vulnerable to Cross-Site Scripting (XSS)."
```

This query uses the `DataFlow` library to follow the flow of data from sources (user input) to sinks (where the data gets used in a potentially unsafe way). It checks that there is no sanitizer (code that cleans the input) in the data flow path from the source to the sink.

### CodeQL for Python

Python is widely used in web and network applications, data analysis, and more. We'll focus on a common security issue in web applications: Open Redirect.

Open Redirect vulnerabilities occur when an application incorporates user-controllable data into the target of a redirection in an unsafe way. Consider the following Python code using the Flask framework:

```python
@app.route('/redirect')
def redirect():
    target = request.args.get('target', '/')
    return redirect(target, code=302)
```

In this example, the application redirects the user to the URL they specified in the `target` parameter. An attacker could use this to redirect the user to a phishing page.

Here's a CodeQL query that finds similar issues in a Python codebase:

```python
import python

from Flask::Redirect::Range redirect, StrConst taint
where redirect.getUrl().(LocalSourceNode).flowsTo(DataFlow::exprNode(taint))
select redirect, "This code may be vulnerable to Open Redirect."
```

This query uses the `Flask` library to identify calls to the `redirect` function. It then uses data flow analysis to check if a tainted string (i.e., user-controlled data) can flow into the URL parameter of the `redirect` call.

### CodeQL for Java

Java is commonly used in web applications, server-side applications, and Android apps. We'll write a CodeQL query to detect a frequent security vulnerability: SQL Injection.

SQL Injection happens when untrusted input is included directly in a SQL query. Consider the following Java code:

```java
String query = "SELECT * FROM users WHERE name = '" + userName + "'";
Statement statement = connection.createStatement();
ResultSet resultSet = statement.executeQuery(query);
```

Here, `userName` is included directly in the SQL query. If it includes SQL code, this code gets executed in the database.

The following CodeQL query identifies similar issues:

```java
import java
import semmle.code.java.dataflow.FlowSources

from DataFlow::PathNode source, DataFlow::PathNode sink, DataFlow::Configuration config
where 
  config.hasFlowPath(source, sink) and
  source.getNode() instanceof FlowSources::UserInput and
  sink.getNode() instanceof SqlInjectionSink
select sink.getNode(), source, sink, "This code may be vulnerable to SQL Injection."
```

This query uses data flow analysis to find paths from user input (the source) to a SQL injection sink (where user input gets included in a SQL query). The `SqlInjectionSink` class is defined in the CodeQL library for Java.

## Additional CodeQL Techniques

After understanding how to write basic CodeQL queries for different programming languages, it's time to deepen your understanding of CodeQL. In this chapter, we will discuss more advanced techniques to write CodeQL queries, such as using path queries for detailed analysis and incorporating control flow analysis.

### Path Queries

While most CodeQL queries simply identify problematic code patterns, sometimes you need more detailed information. Path queries provide more context by showing the data flow path from a source (where data comes from) to a sink (where it ends up). They are especially useful for understanding how a vulnerability arises from the propagation of tainted data.

For example, consider a SQL Injection vulnerability in a Java application. The following path query can help us identify how tainted data flows from a source to a sink:

```java
import java
import semmle.code.java.dataflow.TaintTracking
import semmle.code.java.dataflow.FlowSources

class SqlInjectionConfiguration extends TaintTracking::Configuration {
  SqlInjectionConfiguration() { this = "SqlInjectionConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    source instanceof FlowSources::UserInput
  }

  override predicate isSink(DataFlow::Node sink) {
    sink instanceof SqlInjectionSink
  }
}

from SqlInjectionConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "This code may be vulnerable to SQL Injection."
```

In this query, we define a data flow configuration that identifies tainted data flowing from user input to a SQL Injection sink. The `hasFlowPath(source, sink)` call checks for the existence of such a data flow path. The query then outputs not only the sink, but also the source and the entire data flow path, providing more context about the vulnerability.

### Control Flow Analysis

Control flow analysis allows you to track the execution path through a program. This is useful when you want to understand the order in which statements and expressions are evaluated. CodeQL provides classes and predicates for control flow analysis in its standard libraries.

Here's an example of a control flow analysis query for a Java codebase, which finds places where a null check is performed after a variable is used:

```java
import java

from VarAccess access, NullGuard guard
where
  guard.controls(access.getBasicBlock()) and
  guard.getAGuardedNode().getAControlFlowNode().dominates(access.getControlFlowNode()) and
  not guard.getValue().getAChild*() = access
select access, "This variable is used before a null check."
```

In this query, `NullGuard` represents a control statement (like an `if` statement) that guards against null values, and `VarAccess` represents an access to a variable. The `controls()` call checks if the `NullGuard` controls the basic block of the `VarAccess`, and the `dominates()` call checks if the `NullGuard` is evaluated before the `VarAccess`.

### Advanced Libraries and Class Definitions

As your needs get more complex, you will start to define your own classes and predicates in CodeQL. You can also make use of advanced CodeQL libraries that define classes and predicates for common code patterns and vulnerabilities.

For instance, CodeQL provides libraries for working with various web frameworks (like Express.js for JavaScript and Django for Python), identifying standard sources of user input and sinks of potential vulnerabilities, and tracking the flow of data and control in a program.

Here's an example of a custom class definition in a CodeQL query:

```java
import java

class PublicMutableField extends Field {
  PublicMutableField() {
    this.isPublic() and
    not this.isFinal() and
    not this.isStatic()
  }
}

from PublicMutableField field
select field, "This field is public and mutable."
```

This query defines a new class `PublicMutableField` for public, non-final, non-static fields, which can be unsafe because they can be accessed and modified from anywhere. It then finds all instances of this class in a Java codebase.

### Query Optimizations

As your CodeQL queries get more complex, they may also get slower. There are various ways to optimize CodeQL queries for better performance.

One of the most effective ways to speed up a CodeQL query is to limit the number of possibilities it needs to consider. You can use the `fastest` keyword to prioritize faster computations, or the `strictcount` keyword to ensure accurate results.

Also, it's recommended to use specific types as much as possible. For instance, if you know that a variable represents a string, use `StrConst` instead of `Expr`. The more specific the type, the faster CodeQL can find instances of it.

Here's an example of an optimized CodeQL query for a Python codebase:

```python
import python

from StrConst str, LocalVariable var, Expr use
where
  var.getAnAssignedValue() = str and
  var.getAUse() = use and
  use.getParent*() instanceof ExecStmt
select use, "This code may be vulnerable to SQL Injection."
```

In this query, instead of using the more general `Variable` type, we use the more specific `LocalVariable` and `StrConst` types. We also use the `ExecStmt` type instead of the more general `Stmt` type. This makes the query faster.

## Analyzing Real-World Vulnerabilities with CodeQL

Understanding how vulnerabilities exist in real-world code is a key step towards effective security testing. In this chapter, we're going to take a closer look at a few real-world vulnerabilities and see how we can utilize CodeQL to identify these vulnerabilities in a codebase.

### CVE-2017-5638: Apache Struts Command Injection Vulnerability

One of the most impactful vulnerabilities in recent memory was a command injection vulnerability in the Apache Struts web application framework, identified by CVE-2017-5638. The vulnerability existed in the way Struts processed Content-Type headers in an HTTP request.

The vulnerable code was similar to:

```java
String contentType = request.getContentType();
if(contentType != null && contentType.indexOf("multipart") > -1){
    // process request...
}
```

The problem here is that an attacker could craft a Content-Type header that includes an OGNL expression (a language used in Struts for manipulating data), which gets evaluated when `contentType.indexOf("multipart")` is executed. This allowed the attacker to run arbitrary commands on the server.

A CodeQL query that can detect this type of vulnerability is:

```java
import java
import semmle.code.java.dataflow.TaintTracking

class StrutsContentTypeConfiguration extends TaintTracking::Configuration {
  StrutsContentTypeConfiguration() { this = "StrutsContentTypeConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(MethodAccess).getMethod().getName() = "getContentType"
  }

  override predicate isSink(DataFlow::Node sink) {
    sink.asExpr().(MethodAccess).getMethod().getName() = "indexOf"
  }
}

from StrutsContentTypeConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "Potential Apache Struts Command Injection vulnerability."
```

This query uses the `TaintTracking` library to find data flows from the `getContentType()` method (the source) to the `indexOf()` method (the sink). If there is a data flow path, this could indicate a potential command injection vulnerability.

### CVE-2018-11776: Apache Struts Remote Code Execution Vulnerability

Another critical vulnerability in Apache Struts was identified by CVE-2018-11776. It allowed remote code execution through the use of a specially crafted URL. The root cause was insufficient validation of user-provided untrusted inputs.

The vulnerable pattern in the Struts code was as follows:

```java
String actionName = getActionMappingName(someUserInput);
// actionName used later without proper validation
```

A potential CodeQL query to detect this kind of pattern is:

```java
import java
import semmle.code.java.dataflow.TaintTracking

class StrutsActionNameConfiguration extends TaintTracking::Configuration {
  StrutsActionNameConfiguration() { this = "StrutsActionNameConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(MethodAccess).getMethod().getName() = "getActionMappingName"
  }

  override predicate isSink(DataFlow::Node sink) {
    exists(MethodAccess ma | ma = sink.asExpr() |
      ma.getMethod().getName() in ["addActionError", "addActionMessage", "addFieldError"] and
      ma.getArgument(0) = sink.asExpr()
    )
  }
}

from StrutsActionNameConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "Potential Apache Struts Remote Code Execution vulnerability."
```

This query uses taint tracking to find data flows from `getActionMappingName()` to methods like `addActionError()`, `addActionMessage()`, and `addFieldError()`, which use the result without proper validation, potentially leading to remote code execution.

## Incorporating CodeQL into Security Testing Practices

So far, you have learned how to use CodeQL to identify vulnerabilities in your code. However, to make the most of it, you should incorporate CodeQL into your regular security testing practices. This chapter outlines strategies to embed CodeQL in your organization's security practices, allowing for continuous security testing.

### Security Testing in CI/CD Pipelines

In modern software development, continuous integration and continuous deployment (CI/CD) pipelines are crucial. CodeQL can be integrated into these pipelines to automatically perform security analysis on your codebase whenever changes are made.

For example, GitHub offers a CodeQL GitHub Action that you can use in your GitHub workflows. It automatically scans your codebase whenever you push changes or make a pull request.

Here's a sample workflow configuration for a JavaScript project:

```yaml
name: "CodeQL"

on:
  push:
    branches: [ main ]
  pull_request:
    # The branches below must be a subset of the branches above
    branches: [ main ]

jobs:
  analyze:
    name: Analyze
    runs-on: ubuntu-latest

    steps:
    - name: Checkout repository
      uses: actions/checkout@v2

    - name: Initialize CodeQL
      uses: github/codeql-action/init@v1
      with:
        languages: "javascript"

    - name: Analyze
      uses: github/codeql-action/analyze@v1
```

In this workflow, the `github/codeql-action/init` step initializes the CodeQL database with the languages you specify. The `github/codeql-action/analyze` step then runs CodeQL analysis on the codebase.

### Code Review and Bug Bounties

In addition to automated security testing, CodeQL can also be used in manual security testing practices, such as code reviews and bug bounty programs.

During code reviews, you can use CodeQL queries to look for problematic code patterns related to the changes being reviewed. This can make your code reviews more effective and help educate your developers about secure coding practices.

In bug bounty programs, you can provide CodeQL as a tool for bounty hunters. If they can formulate a CodeQL query that finds a bug, they can submit both the bug and the query. This way, you not only get the bug report, but also a way to prevent similar bugs in the future.

### Training and Awareness

Finally, CodeQL can be a great tool for security training and awareness. By teaching your developers how to use CodeQL, you can make them more aware of secure coding practices and how vulnerabilities arise in code.

For example, you can organize internal workshops where developers write CodeQL queries to find real-world vulnerabilities. You can also use the queries provided by CodeQL's standard libraries as examples to illustrate common vulnerabilities.

### Creating a Secure Development Lifecycle

Incorporating CodeQL into your CI/CD pipelines is just one part of creating a Secure Development Lifecycle (SDL). An SDL incorporates security practices into every stage of your development process, from design and development to testing and deployment.

In the design and development stage, you can use CodeQL to enforce secure coding standards. For example, you can write CodeQL queries that find violations of these standards, and use them to automatically comment on pull requests when violations are detected.

In the testing stage, you can use CodeQL as part of your security testing suite. By integrating CodeQL into your testing frameworks, you can automatically detect vulnerabilities in your codebase. For example, you could set up a nightly build that runs a suite of CodeQL queries against your codebase and alerts you to any new potential vulnerabilities.

In the deployment stage, you can use CodeQL to help with incident response. If a vulnerability is discovered in your application, you can use CodeQL to investigate the root cause and to find any other instances of the same vulnerability in your codebase.

### Writing Custom Queries for Your Codebase

While CodeQL comes with a set of standard queries for common vulnerabilities, you will get the most benefit from CodeQL by writing custom queries that are specific to your codebase. This can help you find vulnerabilities that are specific to the technologies and coding patterns you use.

For example, if your application uses a custom web framework, you could write a CodeQL query that finds instances where user input is not properly sanitized before being used in a SQL query. This can help you find potential SQL injection vulnerabilities that would not be caught by the standard queries.

You can also use custom queries to enforce secure coding standards. For example, you could write a query that finds instances where secure coding standards are not followed, such as using `eval()` in JavaScript, or not validating certificates in SSL connections.

### Training Developers and Security Teams

By teaching your developers and security teams how to use CodeQL, you can enhance their understanding of security vulnerabilities and how to prevent them. This can help reduce the number of vulnerabilities that are introduced into your codebase.

You can offer training sessions where developers and security teams learn how to write and use CodeQL queries. You can also encourage them to contribute to the CodeQL community by sharing their queries and collaborating with others to improve the security of open-source software.