I am a huge fan of Google & its products. Still marveling at the fact that a research paper published as a doctorate thesis could create a revolution in the way we perceive and view the Internet! Google is best at what it does & and it has become a common phenomenon to 234754 search results in 0.15 seconds that our jaws don’t even drop! Ever wondered what goes on behind that search window? Ever thought how any search engine indexes, aggregates and generates all of these search results accurately at the wink of an eye.
Since dissecting Google to see how it works is a very long description of what happens inside it, I will divide it into two blog posts. The present one deals with how Google indexes and stores the web pages while the next one pertains to the actual algorithm that it uses to rank the web pages.
Indexing web pages is an interesting operation in itself. You type a search query and expect the search engine to deliver results QUICKLY and ACCURATELY. The speed with which the results are displayed depends on how fast the search engine can sift through the indexes that it creates. It is analogous to finding a particular word from a dictionary in your hand. Wonder if the dictionary wasn’t arranged alphabetically?
The mighty task of Indexing begins with analyzing the web content – a majority of which is text based. In tech speak, this process is called Parsing and its purpose is to create a text index known as Inverted File.
Parsing
Each & every English sentence consists of Words, White Spaces & Punctuation Marks. A good parsing algorithm extracts the words from a passage of text. Unlike humans, a computer recognizes words as the bits between white spaces. In addition, it must also look for words with punctuation marks. When identifying each word, several characteristics may be stored, such as the words/token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number. This is useful when processing phrases (for eg. Digital Cameras) as search queries.
One of the other desirable characteristics in a parser is that it must avoid duplicate words. Words like ‘a’, ‘is’, ‘the’, ‘in’ and so on (even the last three words) are trivial to search queries hence they must be duly avoided from the index
Obviously, the next step after parsing is to store the generated words(tokens) in a way that each word can be mapped to its source document. This is possible through the use of data structures. Two of the most commonly used data structures that search engines employ are Forward Index & Inverted Index
The Forward Index resembles a list of words in each document. A simplified Forward Index may look like
Document 1 | Computer,Hardware, represents, visible, devices |
Document 2 | Building, right, my, house |
Document 3 | Building, computer, extensive, high-performance, hardware |
As we will discover later that a forward index is of little utility while implementing search. But the rationale behind developing a forward index is as the documents are being parsed, it is better to immediately store the words per document. In real time, a user enters a search query in the form of a word and the search engine must find out documents in which that word occurs. This isn’t possible by implementing a forward index. But if a forward index is sorted by words we would get what we want. This data structure is called an Inverted Index. A simplified inverted index for the above is
Hardware | Document 1, document 3 |
Building | Document 2, document 3 |
Computer | Document 1, document 3 |
In Google, the index includes additional information such as the frequency of each word in each document or the positions of a word in each document. Position information enables the search algorithm to identify word proximity to support searching for phrases; frequency can be used to help in ranking the relevance of documents to the query. Considering the fact that search engines need to index the entire internet’s information the size of the index can be of the order of hundreds of gigabytes. Hence compression techniques are employed.
What happens in real time when searches are performed ?
While describing a simplified inverted index, we stored the words and the names of the documents in which the words were contained. This process although simple isn’t realistic when implemented on a large scale. A better approach would be to store the table as an two dimensional array of bits. If the word X is contained in the document the bit would be set, or else it would be cleared. This approach is also more suitable since a user, a majority of the time, searches for phrases and not individual words.
Let’s say you type the phrase “digital cameras” into the search box. To the search engine it means that it must locate documents containing the word “digital” as well as documents containing the word “cameras”. A person with a Digital Electronics background would know that this is a logical AND operation. The search engine must retrieve data for each word & in order to locate the document sourcing these words, AND operation must be performed on the bitmaps. The resultant bitmap contains bits where both bits were set in the original bitmap. Finally the resultant bitmap can be mapped to the inverted file structure. The only missing task is now to rank the results since there would be a large number of documents wherein the phrase would be found.
Through this article, I have only scratched the surface of how indexing is performed. What happens commercially and in reality is determined by many other factors such as language ambiguity, diverse file formats on the internet, parsing through sections of documents and the like. A novice like me is still on the learning curve of this too & would address this in a future blog.
I sign off giving you a link to a research paper that is of historic importance. Both its name & content are geeky. I read it before writing the above. Anyways have a look at The Anatomy of a Large Scale Hyper Textual Search Engine (aka Google)