Autoencoders: neural networks in the fight against plagiarism – effective autoencoders

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction. Recently, the autoencoder concept has become more widely used for learning generative models of data. Some of the most powerful AI in the 2010s have involved sparse autoencoders stacked inside of deep neural networks.

There are a lot of architectures for autoencoders. But I’d like to focus on the one very effective and powerful tool when we talking about applying from practical point of view. I will focus only on Semantic hashing. Auto-encoders have a number of different applications depending on the type chosen by us (architecture). They are, among others, an alternative to other types of also excellent networks known as RBM (Restricted Boltzman Machine). For instance RBMs are also used by Paypal to detect frauds – DBN (Deep belief network)

Semantic hashing, introduced in 2008 by Ruslan Salakhutdinov and Geoffrey Hinton, 13 is a technique used for efficient information retrieval: a document (e.g., an image) is passed through a system, typically a neural network, which outputs a fairly low-dimensional binary vector (e.g., 30 bits). Two similar documents are likely to have identical or very similar hashes. By indexing each document using its hash, it is possible to retrieve many documents similar to a particular document almost instantly, even if there are billions of documents: just compute the hash of the document and look up all documents with that same hash (or hashes differing by just one or two bits). One of the best field for application of autoencoders is fight against plagiarism[1]

1. Source comes from: Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (p. 443). O’Reilly Media. Kindle Edition.

2. Semantic hashing (authors: Ruslan Salakhutdinov, Geoffrey Hinton