Chopped, Stored, Secured — The Story of the Hash Function

A Python implementation for all of these algorithms is available in my GitHub repository.

Peter Luhn came up with the idea of using math to verify and store information, Birthday problem explains the collisions in Hash tables, Rabin-Karp algorithm uses rolling hash to search strings. We’ve talked so much about hash, yet not a single time was the definition of the hash function itself provided.

When did hashing first appear?

In 1956 the idea of hashing was for the first time defined by Arnold Dumey. His passion for cryptography since 14 and mathematics degree from Columbia University brought him first to the US Army Signal Corps, then to Potter Instrument (printer producing company also interested in engineering of computer memory). Later in his interview for the Charles Babbage Institute Dumey said: “I wrote a paper, which was the first paper on hash coding, based on the work I did there [at Potter Instrument]”. In that paper, Dumey described the concept of using a mathematical transformation to map data to memory addresses for faster search and called it indexing.

How did indexing work?

In his paper, Dumey gave clear instructions.

1. To store a word in memory, first transform it into a number.

Let’s take the word “BOX” as an example. We can convert it, as Dumey suggests, into a base 37 number or just use ASCII. We will go with the first option. To do that, we map letters to their alphabetical positions

	A=1
	B=2
	...
	X=24
	Y=25
	Z=26

Then we multiply them by powers of 37.

In our example, the numeric value of the word “BOX” is

\[2 \times 37^2 + 15 \times 37^1 + 24 \times 37^0 = 2738 + 555 + 24 = 3317\]

Early computer systems and punched tapes primarily dealt with letters and numbers only: 26 letters of the English alphabet, plus 10 numbers, and a space add up to 37 available symbols.

2. Find a prime number slightly less than the number of addressable locations.

In our example, we will have 100 addressable locations. In this case, the nearest prime number is 97.

3. Divide the numeric value by the prime number…

…then throw away the quotient and use the remainder. In other words, perform our favorite modulo operation!

\[3317 \pmod{97} = 19\]

This means the word “BOX” will be stored at memory address 19.

What’s the difference between indexing and hashing?

There’s none. Later indexing will be called a polynomial hash, which we’ve already covered in Rabin-Karp algorithm a few weeks ago. The word hash itself originates in the French word hacher (“to chop”, “to mince”) which pretty much reflects what a hash function does with information before storing it in memory. For some time it was just a widespread jargon among cryptographers. In the late 1960s, the term hash appeared officially for the first time in Herbert Hellerman’s “Digital Computer System Principles”.

So that’s how cryptographic hashes work?

Not really. The idea is the same: use mathematics to transform data. However, the goal now is to also make that data protected from attackers.

The definition of the cryptographic hash that I'll give here is based on a very interesting paper, "New Directions in Cryptography"(1976) by Whitfield Diffie and Martin E. Hellman. It's engaging, easy to read, and covers not only cryptographic hashing but also RSA, so I can really recommend it for further readings.

So what is a cryptographic hash?

Hash is a value used to secure vulnerable data. For example, websites store your login information as a hash, not as plain text, to protect it from being stolen by an attacker. When you register, you enter the password (PW) for the first time. The website produces a hash by using a one-way function f(x), also called a hash function, and stores f(PW). For each of your further logins, the website will calculate f(x) with the data you enter. Only if f(x) equals f(PW) will the password be accepted.

Since your password can still be stolen on its way from your computer to the website, password hashing is usually used together with encryption protocols like TLS to protect your data in transit.

Why is storing f(PW) safer than storing PW?

The main property of the one-way function that creates a hash is its irreversibility. It’s computationally impossible to go back from hash to plain text even if you have the function used for the transformation.

Computationally means that even with the most powerful computer, it will take ages to find the original value.

Why would it be so hard to reverse a hash function?

At school we learned that every function has an inverse function: addition is reversed by subtraction, multiplication by division, an exponent by a logarithm. However, for some functions, there’s simply no known “undo button”. Such functions are called preimage resistant. Let’s look at some examples.

1. High-degree polynomial

The polynomial of degree n looks like this:

\[p(x) = a_n x^n + a_{n-1} x^{n-1} + \dots + a_1 x + a_0\]

A computer can easily calculate p(x) for any value of x by just doing linear additions and multiplications. However, to solve p(x)=y for x is much harder. We all have solved polynomials of first, second degree, maybe even third degree using quadratic and binomial formulas. But what about a fifth degree polynomial? Tenth? According to the Abel-Ruffini Theorem, “The general polynomial of degree n is not solvable by radicals for n ≥ 5”, meaning that there’re no shortcuts or special formulas for solving a high degree polynomial. Therefore, it leaves the attacker no choice but to use iterative algorithms, basically guessing the roots.

But even this is not enough. These high-degree polynomials have to be over a finite field Fp, which means after each operation a modulo of a large prime number is taken from the intermediary result. (Why “a big prime number” is explained here)

2. Discrete Exponentiation.

In other words, once again our favorite modulo operation performed on each step of bringing a to the power of x:

\[y = a^x \pmod{q}\]

This function is also easily calculated for each x. However, its reverse, logarithm over a finite field is again a hard task for an attacker.

Why do these functions have to be over a finite field, and what is a finite field?

Finite field Fp (or Galois field GF(p)) is a field with a finite amount of elements. This amount must equal the power of a prime number. For example, a finite field of 7 can only have values 0, 1, 2, 3, 4, 5, 6. No values in between like 5.99, only integers. In comparison, field R (Real Numbers) includes infinite number of elements.

In school algebra, we got used to working with infinite fields, where the graph is a curve that can be analyzed, traced, and predicted. Finite fields can be imagined more like chaotic dots, where x for y=101 and x for y=100 aren’t even close values. So instead of a “hot and cold” game, the attacker is left with no choice but to brute force the function. It is a very, very, very long process to check, let’s say, 2^256 inputs.

Are these functions always secure?

The fewer collisions there’re, the better. Imagine there’s another X that gives f(X)=f(PW). In this case it doesn’t matter which value the attacker finds, X or PW. Since f(X)=f(PW), he will still successfully pass the login.

Another problem has emerged recently. Remember how we said that it was computationally impossible to reverse the one-way function? Not anymore. Quantum computers exceed the limits of modern computers, making brute force much faster. This creates a need for post-quantum cryptography.

What else can hash be used for?

Other applications of a cryptographically secure hash function include digital signatures for document verification and blockchain. (We’ll cover them in more detail next time.)

My sources and further readings:

Arnold Dumey “Computers and Automation”
Arnold Dumey’s Interview for Charles Babbage Institute
Diffie and Hellman “New Directions in Cryptography”
Wikipedia about hash function
Chicago University about Abel-Ruffini Theorem