A Wake Up Call

I am a huge fan of Google & its products. Still marveling at the fact that a research paper published as a doctorate thesis could create a revolution in the way we perceive and view the Internet! Google is best at what it does & and it has become a common phenomenon to 234754 search results in 0.15 seconds that our jaws don’t even drop! Ever wondered what goes on behind that search window? Ever thought how any search engine indexes, aggregates and generates all of these search results accurately at the wink of an eye.

Since dissecting Google to see how it works is a very long description of what happens inside it, I will divide it into two blog posts. The present one deals with how Google indexes and stores the web pages while the next one pertains to the actual algorithm that it uses to rank the web pages.

Indexing web pages is an interesting operation in itself. You type a search query and expect the search engine to deliver results QUICKLY and ACCURATELY. The speed with which the results are displayed depends on how fast the search engine can sift through the indexes that it creates. It is analogous to finding a particular word from a dictionary in your hand. Wonder if the dictionary wasn’t arranged alphabetically?

The mighty task of Indexing begins with analyzing the web content – a majority of which is text based. In tech speak, this process is called Parsing and its purpose is to create a text index known as Inverted File.

Parsing

Each & every English sentence consists of Words, White Spaces & Punctuation Marks. A good parsing algorithm extracts the words from a passage of text. Unlike humans, a computer recognizes words as the bits between white spaces. In addition, it must also look for words with punctuation marks. When identifying each word, several characteristics may be stored, such as the words/token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number. This is useful when processing phrases (for eg. Digital Cameras) as search queries.

One of the other desirable characteristics in a parser is that it must avoid duplicate words. Words like ‘a’, ‘is’, ‘the’, ‘in’ and so on (even the last three words) are trivial to search queries hence they must be duly avoided from the index

Obviously, the next step after parsing is to store the generated words(tokens) in a way that each word can be mapped to its source document. This is possible through the use of data structures. Two of the most commonly used data structures that search engines employ are Forward Index & Inverted Index

The Forward Index resembles a list of words in each document. A simplified Forward Index may look like

Document 1	Computer,Hardware, represents, visible, devices
Document 2	Building, right, my, house
Document 3	Building, computer, extensive, high-performance, hardware

As we will discover later that a forward index is of little utility while implementing search. But the rationale behind developing a forward index is as the documents are being parsed, it is better to immediately store the words per document. In real time, a user enters a search query in the form of a word and the search engine must find out documents in which that word occurs. This isn’t possible by implementing a forward index. But if a forward index is sorted by words we would get what we want. This data structure is called an Inverted Index. A simplified inverted index for the above is

Hardware	Document 1, document 3
Building	Document 2, document 3
Computer	Document 1, document 3

In Google, the index includes additional information such as the frequency of each word in each document or the positions of a word in each document. Position information enables the search algorithm to identify word proximity to support searching for phrases; frequency can be used to help in ranking the relevance of documents to the query. Considering the fact that search engines need to index the entire internet’s information the size of the index can be of the order of hundreds of gigabytes. Hence compression techniques are employed.

What happens in real time when searches are performed ?

While describing a simplified inverted index, we stored the words and the names of the documents in which the words were contained. This process although simple isn’t realistic when implemented on a large scale. A better approach would be to store the table as an two dimensional array of bits. If the word X is contained in the document the bit would be set, or else it would be cleared. This approach is also more suitable since a user, a majority of the time, searches for phrases and not individual words.

Let’s say you type the phrase “digital cameras” into the search box. To the search engine it means that it must locate documents containing the word “digital” as well as documents containing the word “cameras”. A person with a Digital Electronics background would know that this is a logical AND operation. The search engine must retrieve data for each word & in order to locate the document sourcing these words, AND operation must be performed on the bitmaps. The resultant bitmap contains bits where both bits were set in the original bitmap. Finally the resultant bitmap can be mapped to the inverted file structure. The only missing task is now to rank the results since there would be a large number of documents wherein the phrase would be found.

Through this article, I have only scratched the surface of how indexing is performed. What happens commercially and in reality is determined by many other factors such as language ambiguity, diverse file formats on the internet, parsing through sections of documents and the like. A novice like me is still on the learning curve of this too & would address this in a future blog.

I sign off giving you a link to a research paper that is of historic importance. Both its name & content are geeky. I read it before writing the above. Anyways have a look at The Anatomy of a Large Scale Hyper Textual Search Engine (aka Google)

My previous blog post dealt with how Google (or any other search engine in particular) indexes information. When working on a search query, there might be a number of documents wherein the search query might be located. But the question that is yet to be answered is who (or what) decides which results are more accurate than the others. With the multitude of information that search engines deal with, this task is more like finding out a needle from a haystack! And unless Search Engines don’t do this efficiently they are going to lose customers.

Larry Brin & Sergey Page had this in mind while they wrote their research paper on Google. They devised a clever algorithm that ranked pages in accordance to the links pointing to that page. It was named Pagerank after Sergey Page and also the function that it performed.

Google always affirms that its ranking process is completely democratic. Speaking of democratic, I can’t help but remember the election process. A number of suitable candidates are to be voted for and every voter casts a vote in the form a secret ballot. This is exactly how Pagerank works. Every hyperlink pointing to a webpage counts as a positive vote to it and helps increase its Pagerank. If there are no links to a webpage obviously its Pagerank is 0. But this isn’t that simple.

Let’s say you have your blog & it has 10 incoming links pointing to it. On the other hand, your friend’s blog has only 5 incoming links, then your blog would be considered more important since 10 other pages cite it. Now all of you might be harassed by spam links and ads on websites, right? Would you say that all incoming links to a webpage are equally important? Some of them might even be pop up ads. This is why while ranking your blog based on incoming links; we also need to consider the Pagerank of each of the source page that point to your blog. The outcome of this is that a link from a page of higher Pagerank is more important than a link from a page with low Pagerank. Wait, this isn’t the end!

What if the source page has a large number of links? This obviously decreases the importance of each link. Finally, we thus arrive at the formula for calculating the Pagerank of a page: It is the sum of Pageranks of all pages pointing to it divided by the number of outgoing links each page has. In layman’s terms this means that if a page is cited from well known sources it contains better information & is worth viewing.

The Pagerank is a probability distribution that represents the chances a person randomly clicking on any link will arrive at a page. It is a value between 0 and 1. A probability/Pagerank of 0.7 means that there is a 70% chance of a random surfer reaching to that page. There are many other factors that Google considers a trade secret and would never divulge. But starting from the above formula, Google converts the probability into a figure between 1 and 10 and this is the Pagerank of a page that anyone can check with the help of the Google Toolbar or the Pagerank Extension in Chrome, but it isn’t the actual value that Google uses to judge results.

I bid adieu by listing the Pagerank of Google(10), Wikipedia(9), Yahoo(9), Facebook(8) and the blog your are currently reading (0)..I have a lot to catch up!

A Wake Up Call

Wednesday, February 9, 2011

How Google makes you feel lucky

How Google makes you feel lucky (Contd..)