What is a perceptual hash?

A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar.

Relevance of Perceptual Hashing

Perceptual hashes must be robust enough to take into account transformations or "attacks" on a given input and yet be flexible enough to distinguish between dissimilar files. Such attacks can include rotation, skew, contrast adjustment and different compression/formats. All of these challenges make perceptual hashing an interesting field of study and at the forefront of computer science research.

What is pHash?

pHash is an open source software library released under the GPLv3 license that implements several perceptual hashing algorithms, and provides a C++ API to use those functions in your own programs.

pHash 0.5 Released

07.02.2009 Major feature enhancements. Added perceptual text hash based on Karp-Rabin using cyclic polynomials, java bindings, a new versatile index structure for hash value storage for quick search and retrieval, switched to the GNU autoconf build system. Download.

News and Updates:

06.29.2009 Custom index technique added for quick storage, search and retrieval of all hash values within a given distance of a query. This technique uses a specially developed file format for persistent storage and can be used for virtually any size hash and distance metric. Preliminary testing reveals a 300% improvement in search time over a simple linear search. For image or audio hashes, additional storage amounts to less than 0.05% of the space used by the actual files. To be included in the 0.5 release!

06.22.2009 Support for Textual hashing is now in the library. Although support is limited to plain utf-8 textual encoded documents for now, the functions allow for a quick scan of documents to find string matches and their offsets. Expect this in the next release.

06.05.2009 Changed the build system to use the gnu autoconf tools. This should make things easier to build and install the pHash lib and program files.

04.15.2009> Java bindings for all pHashlib functions.

02.03.2009 pHash now supports hashing for audio files. Derived from frequency spectrum data along the bark scale, this hash is based on characteristics that tend to be the most prominent for the human auditory system. Furthermore, the number of hashes generated per file vary according to the number of samples in the audio file, so short clips can be matched to longer sound files. Naturally, the longer the clip, the more successful it will be. So far, this has proven to work well with 30 second music clips when altered by either mp3 compression and/or telephone simulated filtering.

11.04.2008 The dct hash method has been adapted to video. This is useful for short video clips only, since the entire video is condensed to a fixed length hash.

10.24.2008 Support for an image hash based on the discrete cosine transform. The DCT is a quick and efficent method to write a hash based on frequency data of the underlying image. While it is generally not sophisticated enough to identify visually similar images in any semantically meaningful way, it is fairly robust against minor distortions of the image, such as blurring, rotation and different compression formats.

That's great but what is it good for?

Potential applications include copyright protection, similarity search for media files, or even digital forensics. For instance, an artist could build a database of hashes from a corpus of his works for the purpose of confirming suspect documents. A web crawler could be dispatched to patrol a configured list of sites, grabbing data to confirm against the corpus.

Have another use for pHash? Let us know!