
Search engines
like google,msn and yahoo which crawls a great amount of data and websites everyday can now face the real challenge while dealing with the deep web search as this will turn the face of the internet based findings.
In addition to these trillion pages are larger web of hidden data: financial information, trade catalogs, schedules, medical research and all other materials stored in databases, which are still largely invisible to search engines.
Challenges that face the big search engines to penetrate so-called Deep Web to go a long way towards explaining why they are still no satisfactory answer to questions like “What is the best rate in New York to London next Thursday?” The answers are readily available – if only the search engines able to find on the web them.Most information buried on dynamically generated pages, and standard search engines find. Traditional search engines can not ‘see’ or retrieve content in the Deep Web – the sites do not exist until they are created dynamically as a result of a specific query, the Deep Web hidden.Now was a new kind of technology is in the light form, which range of search engines on the Web is hidden corners. When this happens, it will take more than just improving the quality of search results – may ultimately change the way many companies do business online
“The crawlable Web is the tip of the iceberg,” says Anand Rajaraman, co-founder of Kosmix (www.kosmix.com), a Deep Web search start-up whose investors include Jeffrey P. Bezos, chief executive of Amazon.com. Kosmix has developed software that matches searches with the databases most likely to yield relevant information, then returns an overview of the topic drawn from multiple sources.
Google Web Search Deep strategy includes sending the program to analyze the contents of all databases it encounters. For example, if the search engine will find pages related to the visual arts, will probably think search – “Rembrandt, Picasso, Vermeer,” and so on – in one of these conditions, return match. Search and analyze the results and the development of a predictive model, the database contains.
Some of the Deep Web resources may be included in one or more of the following categories:
Dynamic content – dynamic pages that are returned in response to the request or only accessible through the forms, especially in the case of open-domain input elements (eg text) are used, these items are difficult to move without domain knowledge.
Disconnecting content – sites that are not linked to other sites that may appear on the web surfing programs access the content. This is known as content pages without backlinks (or inlinks).
Private websites – sites that require registration and login (password-protected resources).
Contextual Web – pages with content for the different approaches to different contexts (eg series of client IP addresses or previous navigation sequence).
Restricted access to content – sites that restrict access to your site in a technical way (eg using robots exclusion standard, captcha or Pragma: no-cache/cache-control: no-cache HTTP header, which prohibits them search engines crawl and creating cached copies mode.
Manuscript content – sites that are accessible only through the lines have JavaScript, as well as dynamic content downloaded from Web sites using Flash and AJAX solutions.
Non-HTML/text content – text encoded in multimedia (picture or video) files or specific formats not handled by search engines.
One Response