# » Hash functions

This is the english translation of the first Offensive Security lesson I teach at uni. It is an introduction course to hash functions.

# Definition

Suppose you share a huge file with a friend, but you’re not sure if you both have the same version of the file. You could send your version of the file to your friend and he could compare it to his version. Is there a way to verify that involves less communication than that?

Let’s call your version of the file $x$ (a string) and your friend’s version $y$. The goal is to determine whether $x = y$. A natural approach is to agree on a deterministic function $H$, compute $H(x)$ and send it to your friend. Your friend can calculate $H(y)$ and, since $H$ is deterministic, compare the result to your $H(x)$. For this method to be foolproof, $H$ must have the property that different inputs always correspond to different outputs - in other words, $H$ must be **injective**.

Definition:

A hash function matches arbitrary strings of data to a fixed-length output (called a hash, or digital fingerprint) in a deterministic, public, “random” way.

Important points:

- Arbitrary data strings.
- Fixed length output ($d$).
- Deterministic way: the same input will generate the same output.
- Public way: no secrecy required.
- “Random”: true randomness is almost impossible to have.

The mathematical representation of a hash function is:

$$ H: \lbrace 0,1 \rbrace^* \rightarrow \lbrace 0,1 \rbrace^d $$

where $\lbrace 0,1 \rbrace^*$ represents a string of data (0 or 1) of arbitrary length and $\lbrace 0,1 \rbrace^d$ means a string of data (0 or 1) of length $d$.

This operation is **non-reversible**, which means that it is impossible to recover the original data from the hash.

There is a **collision** in $H$ if for a pair of inputs $(x, y)$, $x \ne y$ and $H(x) = H(y)$. Collisions are widely considered undesirable but are very difficult to avoid, because of the difference in the size of the input set (any file of any size) and the output of the function (a hexadecimal string encoded very often on 32 or 64 bytes). Collisions are nevertheless considered rare, or even impossible (for the best algorithms) thanks to the mathematical complexity of hash algorithms. It is this property that guarantees that the signature of a password or a file is unique.

# Uses of hash functions

## Passwords

We have seen that it is impossible to retrieve the data that allowed to generate a hash. This property makes the hash ideal for storing passwords in a database. Indeed, in the event of a data leak, the passwords are not in clear text. On the other hand, the hashes of passwords considered as weak are easily retrievable, the hash does not protect passwords like `password123`

. Indeed, a password considered simple can easily be recovered via *social engineering* or OSINT (dog’s name + year of birth), or exist in pre-hashed password databases such as CrackStation^{1}.

In order to overcome the so-called *rainbow table* attacks, there are hash algorithms that use a *salt* to complicate the process *a little*. The *rainbow tables* are data structures that allow to retrieve a password from its fingerprint, in an optimized way based on pre-computed tables (time-memory compromise process^{2}). It is important to keep in mind that a *salt* is not a secret. It just serves to disrupt the hash calculation, so that the same entry with a different *salt* will give two different fingerprints, making *rainbow tables* completely useless.

To illustrate that bruteforce is very expensive, even with an algorithm like SHA-256, we can do the following experiment. We have the French dictionary which contains 346200 entries.

```
$ wc -l /usr/share/dict/french
346200 /usr/share/dict/french
```

We can design a small script that, for each entry in the dictionary, will hash it. We redirect the output to `/dev/null`

because we don’t really need it. What we are interested in here is the execution time of this script.

```
#!/bin/bash
counter=0
tot=$(wc -l /usr/share/dict/french | awk '{print $1}')
while read l; do
echo -n $l | sha256sum > /dev/null 2>&1
counter=$(("$counter" + 1 ))
echo "${counter}/${tot}"
done < /usr/share/dict/french
```

You can run it with the `time`

utility, and even get notified when the script finishes:

```
$ time ./script.sh && dunstify -u critical "script has ended"
```

## Integrity check

Because of their deterministic and unique properties, hash functions are widely used to verify the integrity of files when sharing them. Often, people who share software or documents online also share the hash of the document, called checksum, and specify the algorithm that was used to generate it. This allows anyone who wants to retrieve the document to verify that it is the right one.

## Hash tables

Let’s say we have an array containing names that are not sorted:

To find a name without knowing its index, one could use a bruteforce approach as a linear search (all items one after the other until one finds the one one we are looking for), with a complexity of $O(n)$. This can be quite long for a large table.

Now, let’s imagine that we know the index of the first name. In this case, it is much easier. The complexity is $O(1)$.

Question: how can we know which element of the array contains the name we are looking for, i.e. how can we know the index?

Each index can be calculated using the value itself, so the index is linked to the value.