Not only is md5sum not proven to have equal dostribution, it is specifically known to not have equal distribution, only nearly equal distribution.
A hash function is any function that converts an arbitrary input size into a specific output size deterministically. No other requirements are there. A hash for a simple job could be just adding the ascii values together and give the output. Needless to say, that would not have an even distribution.
You are correct for regular hash functions, but a cryptographic hash function has stronger requirements.
MD5 was supposed be a cryptographic hash function, but it was found to be flawed all the way back in 1996, and has been discouraged ever since… Now it’s too weak to be used in a cryptographic setting, and too slow to be used in non-cryptographic settings.
This is why hashes like xxhash is considered a non-cryptographic hash function, while SHA-256 is considered a cryptographic hash function.
This is about cryptographic hashing functions (to be fair, could have spelled that out in my prior comment, but in general when someone talks about anything security relevant in conjunction with hashing, they always mean cryptographic hashing functions).
MD5 is not a cryptographic hashing function for exactly these reasons.
Also, the example you gave in your original comment wasn’t actually about distribution but about symbol space.
By multiplying by four (and I guess you implicitly meant that the bit length of the hash stays the same thus dropping two bits of information) you are reducing the number of possible hashes by a factor of four, because now 3/4 of all potential hashes can’t happen any more. So sure, if your 64bit hash is actually only a 62bit hash that just includes two constant 0 bits, then of course you have to calculate the collision chance for 62bits and not 64bits.
But if all hashes are still possible, and only the distribution isn’t perfectly even (like is the case with MD5), then the average chance for collisions doesn’t change at all. You have some hashes where collisions are more likely, but they are perfectly balanced with hashes where collisions are less likely.
The article calls out hash functions and links to the relevant Wikipedia page, so I don’t think this is solely about cryptographic hash functions, though that seems to be what you were talking to the other user about.
I see what you are saying. But if you aren’t using a cryptographic hash function then collisions don’t matter in your use case anyway, otherwise you’d be using a cryptographic hash function.
For example, you’d use a non-cryptographic hash function for a hashmap. While collisions aren’t exactly desireable in that use case, they also aren’t bad and in fact, the whole process is designed with them in mind. And it doesn’t matter at all that the distribution might not be perfect.
So when we are talking about a context where collisions matter, there’s no question whether you should use a cryptographic hash or not.
Of course. That’s one of the basic requirements for something to be an actual hash function.
Not only is md5sum not proven to have equal dostribution, it is specifically known to not have equal distribution, only nearly equal distribution.
A hash function is any function that converts an arbitrary input size into a specific output size deterministically. No other requirements are there. A hash for a simple job could be just adding the ascii values together and give the output. Needless to say, that would not have an even distribution.
You are correct for regular hash functions, but a cryptographic hash function has stronger requirements.
MD5 was supposed be a cryptographic hash function, but it was found to be flawed all the way back in 1996, and has been discouraged ever since… Now it’s too weak to be used in a cryptographic setting, and too slow to be used in non-cryptographic settings.
This is why hashes like xxhash is considered a non-cryptographic hash function, while SHA-256 is considered a cryptographic hash function.
This is about cryptographic hashing functions (to be fair, could have spelled that out in my prior comment, but in general when someone talks about anything security relevant in conjunction with hashing, they always mean cryptographic hashing functions).
MD5 is not a cryptographic hashing function for exactly these reasons.
Also, the example you gave in your original comment wasn’t actually about distribution but about symbol space.
By multiplying by four (and I guess you implicitly meant that the bit length of the hash stays the same thus dropping two bits of information) you are reducing the number of possible hashes by a factor of four, because now 3/4 of all potential hashes can’t happen any more. So sure, if your 64bit hash is actually only a 62bit hash that just includes two constant 0 bits, then of course you have to calculate the collision chance for 62bits and not 64bits.
But if all hashes are still possible, and only the distribution isn’t perfectly even (like is the case with MD5), then the average chance for collisions doesn’t change at all. You have some hashes where collisions are more likely, but they are perfectly balanced with hashes where collisions are less likely.
The article calls out hash functions and links to the relevant Wikipedia page, so I don’t think this is solely about cryptographic hash functions, though that seems to be what you were talking to the other user about.
I see what you are saying. But if you aren’t using a cryptographic hash function then collisions don’t matter in your use case anyway, otherwise you’d be using a cryptographic hash function.
For example, you’d use a non-cryptographic hash function for a hashmap. While collisions aren’t exactly desireable in that use case, they also aren’t bad and in fact, the whole process is designed with them in mind. And it doesn’t matter at all that the distribution might not be perfect.
So when we are talking about a context where collisions matter, there’s no question whether you should use a cryptographic hash or not.