Last December rockyou.com was compromised and 32 million passwords were leaked. Imperva and others have published some basic statistics about these passwords, but much more can be learned from this archive. Indeed this material gives very valuable information about how real users choose a password, and it can help to improve your cracking rules. For instance, here are a few observations obtained by simple greping :

Rockyou.com users clearly love numbers: 54% of all passwords contain at least a digit and 16% of them are entirely made of digits. Out of all the passwords that contain at least a digit, 43% of them use either a single digit or two consecutive ones. 91% of users put these one or two digits at the end of their password.

Punctuation characters are much less popular than numbers, as less than 4% of all passwords contain at least one. When rockyou users chose a punctuation character, 21% of the time they took a ‘!’ and 85% of the time they put it at the end.

Many more interesting patterns can be obtained from the rockyou archive. And all these patterns translate nicely into cracking rules for your favourite password cracker which will largely improve your cracking performances:

I believe the main problem with passwords it this: to select a password most people first pick a word or an easy sequence they can remember, and then they modify it to comply with the local password policy. But these additional modifications are very similar across people: usually digits and punctuation go to the end, while capital letters come at the beginning. Other linguistic patterns are strong: if people insert an opening parenthesis, they will most of the time close it within the same password (85% do in the rockyou archive).

Also people are lazy. If the password policy requires at least one capital letter, let’s use just one; after all, holding the shift key too long is kind of annoying. And if that painful administrator forces me to change my password every 2 months, why create a complete new password every time when I can just increment the last digit and pass the check ?

Now that large lists of real-world passwords are becoming more and more available, people are trying to automate the extraction of efficient rules from them. Two efforts worth mentioning are looking for hidden Markov models in passwords: the Markov generator by Solar Designer and the Probabilistic Password Cracker by Matt Weir.

## Combining cryptographic hashes, part 3

In previous posts I discussed how Joux’s multicollision attack significantly decreases the efficiency of concatenating hash functions and how it can be used to break the birthday limit of md5. Now the obvious remaining question is: if concatenation is not effective, are there better ways to combine two hash functions ? As often in cryptography, it is very easy to come up with a solution which is neat, simple, and wrong: for instance, would you have said that $F(M) || G(F(M) \oplus M)$ is insecure ? Luckily the smart people at the Emmy Noether Research Group give us two examples of good combiners.

Hash-tree based concatenation
In a paper from 2007, Fischlin and Lehmann introduce a security-amplifying combiner based on the concatenation combiner. Their idea is quite simple : to build a $2^n$ multicollision on an n-bits iterative hash function, Joux’s attack requires n consecutive calls to the compression function. In other words, the colliding messages must have a size equal to or larger than $n * B$ bits, where B is the message block size of the hash function.

Conversely, limiting the size of the input messages would also decrease the number of multicollisions found by Joux’s attack, and therefore the probability of finding a pair of messages which collide in both hash functions. For instance, for messages of size $n/4$ blocks, Joux’s attack can only build a $2^{\frac{n}{4}}$ multicollision, and the probability of finding a pair of colliding messages drops to $2^{-\frac{n}{2}}$.

Input messages obviously have an arbitrary size, so to ensure that the concatenation combiner $H = F || G$ only process a small number of message blocks, a hash-tree structure can be used. For instance, if we decide to break the message in groups of 4 blocks the hash-tree would look like this:

The size of the groups is a tradeoff between security and speed, but you should be safe as long as you keep the groups smaller than $n/4$ blocks.

The $C_4p$ combiner
In 2009 the same Fischlin and Lehmann introduced a new family of secure combiners. The basic combiner of this family is called $C_4p$, and it has the following construction:

where:
$H_0$ and $H_1$ are the two hash functions to combine
$\oplus$ is the XOR operator
$H_\oplus = H_0 \oplus H_1$
$H_k^i(M) = H_k(i || M)$, in other words the binary representation of $i$
is preprended to the message $M$ before hashing with $H_k$ (k=0 or 1)
$H_\oplus^i(M) = H_0^i(M) \oplus H_1^i(M) = H_0(i||M) \oplus H_1(i||M)$

With a combination of XOR operations, concatenations and permutations, the $C_4p$ combiner outputs $2n$bits hash values (same length as the classical concatenation combiner) and is robust against collisions. In their article the authors describe several modifications of $C_4p$ which provide additional properties, such as pre-image robustness and indifferentiability from random oracles.

Last words
In march Fischlin and Lehmann will present at the CT-RSA 2010 conference a paper titled “Hash Function Combiners in TLS and SSL”, it should be fun 🙂

## Combining cryptographic hashes, part 2

In a previous post I discussed Joux’s multicollision attack and how it can significantly decrease the efficiency of concatenating hash functions. Today I will discuss a recent paper (December 2009) by Mendel, Rechberger and Schlaffer, which shows how specific attacks against two hash functions can be used against the concatenation combiner. Using this attack, the authors can find collisions in md5||SHA1 with a complexity around $2^{59}$, thus breaking the $2^{64}$ birthday limit.

First, let’s restate the old result : the cost to find collisions in F||G with Joux’s multicollision attack is $\frac{n}{2}.2^{\frac{m}{2}} + 2^{\frac{n}{2}}$ where m and n are the respective output lengths of F and G. This total is calculated by adding the cost of finding a $2^{\frac{n}{2}}$ multicollision in F ($\frac{n}{2}.2^{\frac{m}{2}}$) and the cost of hashing these $2^{\frac{n}{2}}$ messages to find a collision in G. Now even if there were a better attack against F, this would decrease the first term of the total cost, but not the second one (the verifying step). As a result the total cost would always be higher than $2^{\frac{n}{2}}$, so Joux’s attack is clearly not sufficient to break the birthday limit.

Let’s now see the new attack, and to simplify things let’s consider that m=n. The new attack is based on two assumptions:

• F suffers from a type-1 collision attack, meaning that given a message block m0 we can find m1 and m1′ such that $F(m_0||m_1)=F(m_0||m_1\prime)$ with a complexity lower than $2^{\frac{n}{2}}$
• G suffers from a type-3 collision attack, meaning that given two message blocks m2 and m2′ we can find m3 such that $G(m_2||m_3)=G(m_2\prime||m_3)$ with a complexity lower than $2^{\frac{n}{2}}$

The attack then goes as follows:

1. Using the type-1 collision attack in F, find k pairs of message blocks $(m_1,...,m_k) (m_1\prime,...,m_k\prime)$ such that for any $i \leq k$, $F(m_1||...||m_i)=F(m_1\prime||...||m_i\prime)$. The value of k depends on the size and structure of G. Following Joux' attack, this effectively gives a $2^k$ multicollision on F, in other words $2^k$ messages which hash to the same value through F.
2. Using these $2^k$ messages as prefix, use the type-3 collision attack on G to find a message block p such that $G(m_1||...m_k||p) = G(m_1\prime||...m_k\prime||p)$. A large part of the paper describes how to perform this attack in an efficient way, using a combination of birthday techniques and differential shortcut techniques.
3. Finding p also gives a collision on F, as we append p to a known pair of colliding messages for F.

This is the first documented attack where a weakness in two hash functions can be combined to break the birthday limit of the concatenation, and it will be interesting to see if it can be applied to other hash functions.

## O’Donnell, Cohen and the perfect anti-virus

Recently I stumbled upon this post where Adam O’Donnell discusses the impossibility of building a perfect anti-virus. He then pointed me to a 1984 paper by Cohen, which is apparently considered to be the first demonstration that no algorithm can correctly detect all possible computer viruses. Cohen’s proof goes like this :

1. Let’s define a computer virus as a program that can infect other programs by modifying them to include a possibly evolved copy of itself.
2. Let AV be a candidate virus detection algorithm. AV takes a program as input and returns true if it detected a virus, false otherwise.
3. Let P be a program which contains the single following pseudocode statement:

if AV(P) exit(); else infect_others();

In other words, P calls the detection algorithm AV on itself. Provided that AV returns, two possibilities exist :

4. If AV(P) returns true, P exits without infecting other programs. Thus by identifying P as a virus, AV gives a false positive, since P will not act as a virus.
5. If AV(P) returns false, P will infect other programs. Thus AV produced a false negative, as it did not identify P as a virus, although P is one.
6. From 4 and 5 : in both cases AV does not return the correct result when called on P, thus AV is not a perfect virus detection algorithm.

This proof is pretty weird, as it mixes up being a virus and behaving like a virus. P is a virus, both according to common sense and Cohen’s own definition 1 — in fact, P cannot do anything but infect other programs. Since P is obviously a virus, its detection at step 4 is not a false positive, even if no infection occurs during a particular run of P. So the contradiction at step 6 disappears and thus the conclusion.

What I find really puzzling in Cohen’s reasoning is that he considers a program P which he completely defines by its source code. But is P a virus ? Funnily enough, even with the source code Cohen cannot answer until P has been executed, as the answer depends on the output of AV(P). Contrary to his own definition, step 4 is actually based on behaviour, which is clearly not sufficient to decide if a program is a virus (slow/sparse infectors, logic bombs, etc).

Last words
Not only I find Cohen’s proof very unconvincing, but I also find it quite amusing from Mr O’Donnell to use a theoretical result to argue that making good anti-viruses is difficult. Imagine a football coach asking a player to run faster, and the guy objecting : “sorry coach, but running fast is hard because there’s a theoretical limit to the speed I can reach, called C …”. In both cases I think their actual performances are much too far from the theoretical limit to use it as an excuse.

## On concatenating hashes

This post was prompted by a twit from Didier Stevens about a presentation from Bart Preneel at the OWASP Benelux conference. On slide 17, citing a 2004 paper from Joux, Dr Preneel writes that “the concatenation of 2 iterated hash functions is as most [sic] as strong as the strongest of the two (even if both are independent)“. This was widely understood as “concatenating hashes results in an equal or lower security”, which is not really correct as this post will show.

Hash basics
A hash function is an algorithm which takes variable size input message and returns a fixed size output. For a good hash, we want three properties:

• collision resistance: it is very difficult to find two input messages which hash to the same value.
• preimage resistance: given a hash value, it is very difficult to find a message which hashes to this value.
• 2nd preimage resistance: given a message, it is very difficult to find another message which hashes to the same value.

For an ideal n-bits hash, that is a hash against which no better method than exhaustive search is known, the complexity of finding a collision is $2^{\frac{n}{2}}$ (due to the birthday paradox), and the complexity of finding the first and second preimages is $2^n$. When concatenating two ideal hashes of n and m bits, the complexity of the combined hash becomes respectively $2^{\frac{n}{2}+\frac{m}{2}}$ for collisions and $2^{n+m}$ for preimages. Finally, a k-multicollision for a given a hash function is a set of k messages which hash to the same value. The complexity of finding a k-multicollision for an ideal n-bits hash is $2^{\frac{n(k-1)}{k}}$.

Iterative hashes
Now let’s consider an ideal iterative n-bits hash function (also known as a Merkle–Damgård construction), which has the following structure :

###### source: wikipedia

With such a function, the hash value is calculated by repeatedly iterating a compression function f over a combination of a message block and the previous output of the compression function. The first iteration of the compression function uses the first message block and a fixed initialization vector (IV). Nearly all hashes created before 2004 use this structure, including MD5, SHA-1, SHA-256, SHA-512, RIPEMD-160 and Tiger.

Joux’s multicollision attack
In his paper, Joux shows that the iterative structure of a hash actually helps an attacker to find multicollisions. In short, he proves that finding a $2^t$ multicollision (that is, $2^t$ messages which hash to the same value) only takes t times the efforts needed to find one collision. This result has a huge impact on the security of concatenated hashes, as Joux describes in the following two attacks.

Let’s consider two ideal iterative hash functions F and G of respective output lengths m and n bits (n >= m). One can find collisions in F||G in two steps :

1. create a $2^{\frac{n}{2}}$ multicollision in F, which has a complexity in the order of $\frac{n}{2}2^{\frac{m}{2}}$.
2. hash these $2^{\frac{n}{2}}$ messages with G and look for a collision. Due to the birthday paradox, there is a good chance one will occur.

For a preimage attack, given two hash values Hf and Hg, we look for a message M such that F(M)=Hf and G(M)=Hg. 3 steps :

1. create a $2^n$ multicollision in F, which has a complexity of $n2^{\frac{m}{2}}$
2. by exhaustive search, find a last block to append to one of the previous messages to obtain the desired F(M) hash value. With complexity $2^m$, this gives $2^n$ messages which hash to the desired F(M) value
3. by exhaustive search, hash these $2^n$ messages and look for the desired G(M) value. Due to the birthday paradox, there’s a good chance one will occur

In the end, the complexity of finding a collision in the combined hash is in the order of $\frac{n}{2}2^{\frac{m}{2}}+2^{\frac{n}{2}}$ and the complexity of finding preimages is $n2^{\frac{m}{2}}+2^m+2^n$, which is much lower than the theoretical values. Furthermore, it is important to note that only F was required to be iterative for these attacks, no assumption was made about G to build these attacks. For instance, let’s consider two 160 bits hashes which have respective strengths of $2^{80}$ against collisions and $2^{160}$ against preimages : by concatenating these two hashes (thus obtaining a 320 bits hash value), the respective strengths become about 87 bits against collisions and 161 bits against preimages.

Last words
Contrary to the general interpretation of this OWASP presentation, concatenating two good hash functions (with a least one of them being iterative) does not decrease the security, but brings a very small improvement. However, two important questions remain:

1. Joux’s attack uses the birthday paradox on ideal hashes, but what if another effective attack against one of the hashes exists ? Could concatenating a broken hash with a bad one yield a weaker security than the good hash alone ?
2. If concatenation is not effective, are there better ways to combine hashes so that the resulting combination is much stronger than any of the two ?

Any idea ? 🙂