Back to Internet Tutorials

Conducting Research on the Internet

The Internet provides access to a wealth of information on countless topics contributed by people throughout the world. On the Internet, a user has access to a wide variety of services: vast information sources, electronic mail, file transfer, interest group membership, interactive collaboration, multimedia displays, and more. The Internet consists primarily of a variety of access protocols. These include e-mail, FTP, HTTP, Telnet, and Usenet news. Many of these protocols feature programs that allow users to search for and retrieve material made available by the protocol.

For background information on Internet access protocols, see A Basic Guide to the Internet.

The Internet is not a library in which all its available items are identified and can be retrieved by a single catalog. In fact, no one knows how many individual files reside on the Internet. The number runs into a few billion and is growing at a rapid pace.

The Internet is a self-publishing medium. This means that anyone with little or no technical skills and access to a host computer can publish on the Internet. It is important to remember this when you locate sites in the course of your research. Internet sites change over time according to the commitment and inclination of the creator. Some sites demonstrate an expert's knowledge, while others are amateur efforts. Some may be updated daily, while others may be outdated. As with any information resource, it is important to evaluate what you find on the Internet. For more information, see Evaluating Web Content.

Also be aware that the addresses of Internet sites frequently change. Web sites can disappear altogether. Do not expect stability on the Internet.

One of the most efficient ways of conducting research on the Internet is to use the World Wide Web. Since the Web includes most Internet protocols, it offers access to a great deal of what is available on the Internet.


HOW TO FIND INFORMATION ON THE INTERNET


There are a number of basic ways to access information on the Internet:

  1. Go directly to a site if you have the address
  2. Browse
  3. Explore a subject directory
  4. Conduct a search using a Web search engine
  5. Query a service devoted to digitized scholarly materials or books
  6. Explore the information stored in live databases on the Web, known as the "deep Web"
  7. Join an e-mail discussion group or Usenet newsgroup
  8. Subscribe to RSS feeds

Each of these options is described below.

1. GO DIRECTLY TO A SITE IF YOU HAVE THE ADDRESS

If you know the Internet address of a site you wish to visit, you can use a Web browser to access that site. All you need to do is type the URL in the appropriate location window. URL stands for Uniform Resource Locator. The URL specifies the Internet address of the electronic document. Every file on the Internet, no matter what its access protocol, has a unique URL. Web browsers use the URL to retrieve the file from the host computer and the directory in which it resides. This file is then downloaded to the user's computer and displayed on the monitor.

This is the format of the URL:     protocol://host/path/filename

For example:

http://www.house.gov/agriculture/schedule.htm - a hypertext file on the Web
ftp://ftp.uu.net/graphics/picasso - a file at an FTP site
telnet://locis.loc.gov - a Telnet connection

Any of these address can be typed into the location window of a Web browser.

2. BROWSE

Browsing home pages on the Web is a haphazard but interesting way of finding desired material on the Internet. Because the creator of a home page programs each link, you never know where these links might lead. High quality starting pages will contain high quality links.

3. EXPLORE A SUBJECT DIRECTORY

Universities, libraries, companies, organizations, and even volunteers have created subject directories to catalog portions of the Internet. These directories are organized by subject and consist of links to Internet resources relating to these subjects. The major subject directories available on the Web tend to have overlapping but different databases. Most directories provide a search capability that allows you to query the database on your topic of interest.

When to use directories? Directories are useful for general topics, for topics that need exploring, for in-depth research, and for browsing.

There are two basic types of directories: academic and professional directories often created and maintained by subject experts to support the needs of researchers, and directories featured on commercial portals that cater to the general public and are competing for traffic. Be sure you use the directory that appropriately meets your needs.

Subject directories differ significantly in selectivity. For example, the famous Yahoo directory does not carefully evaluate user-submitted content when adding Web pages to its database. It is therefore NOT a reliable research source and should not be used for this purpose. In contrast, INFOMINE selects only those sources considered useful to the academic and research community. Consider the policies of any directory that you visit. One challenge to this is the fact that not all directory services are willing to disclose either their policies or the names and qualifications of site reviewers. A number of subject directories consist of links accompanied by annotations that describe or evaluate site content. A well-written annotation from a known reviewer is more useful than an annotation written by the site creator as is usually the case with Yahoo.

It is useful to understand that certain directories are the result of many years of intellectual effort. For this reason, it is important to consult subject directories when doing research on the Web.

To get an idea of the range of directories available on the Web, connect to a list of Internet Subject Directories.

Recommended starting points:

4. CONDUCT A SEARCH USING A WEB SEARCH ENGINE

An Internet search engine allows the user to enter keywords relating to a topic and retrieve information about Internet sites containing those keywords. Search engines are available for many of the Internet protocols. For example, Archie searches for files stored at anonymous FTP sites.

Search engines located on the Web have become quite popular as the Web itself has become the Internet's environment of choice. Web search engines have the advantage of offering access to a vast range of information resources located on the Internet. Many search engines also search multimedia or other file types on the deep Web, often accessible as separate searches. Web search engines tend to be developed by private companies, though most of them are available free of charge.

A Web search engine service consists of three components:

Keep in mind that spiders are indiscriminate. Be aware that some of the resources they collect may be outdated, inaccurate, or incomplete. Others, of course, may come from responsible sources and provide you with valuable information. Be sure to evaluate all your search results carefully.

With most search engines, you fill out a form with your search terms and then ask that the search proceed. The engine searches its index and generates a page with links to those resources containing some or all of your terms. These resources are usually presented in ranked order. Term ranking was once a popular ranking method, in which a document appears higher in your list of results if your search term appears many times, near the beginning of the document, close together in the document, in the document title, etc. These may be thought of as first generation search engines.

A more sophisticated development in search engine technology is the ordering of search results by concept, keyword, site, links or popularity. Engines that support these features may be thought of as second generation search engines. These engines offer improvements in the ranking of results. One reason for this is the insertion of the human element in determining what is relevant. For example, Google ranks results according to the number of highly ranked Web pages that link to other pages. A Web page becomes highly ranked if still other highly ranked pages link to them. This scheme represents an intriguing melding of technology and human judgment.

All search engines have rules for formulating queries. It is imperative that you read the help files at the site before proceeding. Online tutorials can also help you learn the rules. A short list of recommended tutorials appears at the end of this file.

Recommended starting points:

  1. Start with Google. This is a famous search engine that ranks pages based on the number of links from pages ranked high by the service. The more highly ranked pages that contain these links, the higher the linked-to page will be ranked. These highly ranked linking pages, in turn, are also determined by the number of highly ranked pages that link to them. The idea here is that a high quality page will be found and linked to from another high quality page. The vast popularity of Google is a testament to the usefulness of this ranking scheme. Google has dubbed this ranking system PageRank.
  2. Another interesting link-ranking engine is Ask.com. The Ask.com link ranking scheme, called ExpertRank, is a bit different from Google's. Ask.com ranks links from pages in the same subject "community" as the topic being searched. The idea here is that people maintaining Web pages on individual topics are experts in this topic.
  3. Ixquick is a good place to try if your topic is obscure or if you want to retrieve results from a variety of search engines with a single search. This service searches multiple search tools simultaneously and returns your results in a single list that removes the duplicate files. This type of search processing is called meta searching. Even better, Ixquick only returns the top ten relevancy-ranked results from the source search services. This means that you can take advantage of the collective relevancy judgment of many tools at once. Other recommended meta search engines include Clusty and Don Busca.

For a more extensive list of recommended Web search engines, see Internet Search Engines.

5. QUERY A SERVICE DEVOTED TO DIGITIZED SCHOLARLY MATERIALS OR BOOKS

Dot-coms have become interested in offering free searches of the world's literature as found in books and scholarly materials. Once results are found, users can access the material based on its copyright status. Material out of copyright are generally fully available for viewing and printing, while only snippets of text or abstracts are available for copyrighted works. In either case, these services are opening up an enormous amount of the world's printed material to be freely searched. The potential benefits to the research process are only beginning to be understood.

Two notable sites for book searches are Amazon and Google Book Search. Amazon has its "Search Inside the Book" feature that offers a full text search as well as other features including links to related works and a concordance of the top 100 most common words. Google's service offers books derived from publisher agreements and also from the collections of notable libraries. Google's intention is to digitize all the books in the world - we will see if this succeeds.

Scholarly material in the form of journal articles and other similar works are also becoming available to be freely searched. Sites include Google Scholar and Windows Live Search Academic. Google Scholar enhances the research process by allowing users to explore works that cite items listed in your results. Users in academic institutions can often gain access to the full text of these materials. Others can purchase materials of interest.

Other services of these types are in the planning stages. They have the potential to turn the Web into a truly significant medium for research.

6. EXPLORE THE DEEP WEB

The concept of the "deep" or "invisible" Web is a challenging one. This refers to content that is stored in databases accessible on the Web but usually not available via search engines. In other words, this content is "invisible" to search engines. This is because spiders cannot or will not enter into databases and extract content from them as they can from static Web pages. In the past, these databases were fewer in number and referred to as specialty databases, subject specific databases, and so on.

The best way to access information on the invisible Web is to search the databases themselves. Topical coverage runs the gamut from scholarly resources to commercial entities. Very current, dynamically chaniging information is likely to be stored in databases, including news, job listings, available airline flights, etc. As the number of Web-accessible databases grows, it will become essential that they be used to conduct successful information finding on the Web.

Other content usually not gathered by spiders includes non-textual files such as multimedia files, graphical files, and documents in non-standard formats such as Portable Document Format (PDF). Google is one of the exceptions here, since it indexes PDF, Word, and other documents in its searchable index.

Content available on sites protected by passwords or other restrictions is also a part of the deep Web. Some of this is fee-based content, such as subscription databases or e-journals paid for by libraries and available to their users based on various authentication schemes.

Keep in mind that many search engine sites and commercial portals feature deep Web content as part of their package of services. This phenomenon falls under the heading of converging content, and is present on nearly all search engines these days For example, you can visit AlltheWeb and look up news, pictures, video and audio, all outside the purview of a spider- gathered index.

7. JOIN AN E-MAIL DISCUSSION GROUP

Join any of the thousands of e-mail discussion groups. These groups cover a wealth of topics. You can ask questions of the experts and read the answers to questions that others ask. Belonging to these groups is somewhat like receiving a daily newspaper on topics that interest you. These groups provide a good way of keeping up with what is being discussed on the Internet about your subject area. Be careful to evaluate the knowledge and opinions offered in any discussion forum.

E-mail discussion groups are managed by software programs. There are three in common use: Listserv, Majordomo, and Listproc. The commands for using these programs are similar.

8. READ BLOGS AND SUBSCRIBE TO RSS FEEDS

Blogs are a fast-growning phenomenon of the Web. These are sites that present postings by one or more people, to which readers can comment. While many blogs serve the purpose of personal ruminations, others feature commentary and discussion on current events, academic research and professional topics. Good examples of academic-related blogs can be found on George Mason University's History News Network. Technorati is the premier search tool for locating blogs.

It is easy enough to start your own blog using such free services as Blogger and WordPress.

One of the newer communication technologies on the Web is RSS. This variably stands for Rich Site Summary, Really Simple Syndication, and so on. RSS allows people to place news and other announcement-type items into a simple XML format that can then be pushed to RSS readers and Web pages. Users can subscribe to the RSS newsfeeds of their choice, and then have access to the updated information as it comes in. RSS is used for all kinds of purposes, including the news itself and announcing new content on Web sites. Many RSS feeds come from the content of blogs.

RSS content may be read by using an RSS reader, or aggregator. This is usually free software that you can install on your computer that posts new items and stores old ones in a graphical interface. An RSS reader similar to e-mail software in that it displays incoming items and can store content for offline reading. Subscribing to a newsfeed is usually as simple as entering the address of the RSS document. A useful list of RSS readers is available on the site of RSS Compendium. Some Web browsers, such as Firefox and Internet Explorer 7, offer the convenience of built-in RSS readers.

It is also possible to subscribe to and read your own collection of RSS feeds on Web sites devoted to this purpose. Bloglines is one such example. The advantage here is that you can access your RSS feeds from any computer that is connected to the Web.


PRACTICAL STEPS: WEB SEARCH ENGINES


HOW TO FORMULATE QUERIES

There are three steps to a computer database search:

1. Identify your concepts

When conducting any database search, you need to break down your topic into its component concepts. For example, if you want to find information on the budget negotiations between President Bush and the Democrats, these are your concepts: BUSH, DEMOCRATS, BUDGET.

2. List keywords for each concept

Once you have identified your concepts, you need to list keywords which describe each concept. Some concepts may have only one keyword, while others may have many.

For example:

BUSH

DEMOCRATS
HOUSE SPEAKER

BUDGET
BUDGET NEGOTIATIONS
BUDGET BATTLE
BUDGET IMPASSE
BUDGET DEAL

Depending on the focus of your search, there may be other keywords you would wish to use.

3. Specify the logical relationships among your keywords

Once you know the keywords you want to search, you need to establish the logical relationships among them. The formal name for this is Boolean logic. Boolean logic allows you to specify the relationships among search terms by using any of three logical operators: AND, OR, NOT.


Search Statement              Result of search

World War I   AND            Files containing both these terms
World War II   

World War I   OR             Files containing at least one of these terms
World War II

World War I   NOT            Files containing the term World War I but
World War II                    not also the term World War II

Most search engines offer Boolean searching without mentioning the logical operators by name. For example, you might be asked to list your search terms and choose that All of these terms be searched. This denotes AND logic. Specifying Any of these terms denotes OR logic. Most search engines also use a type of implied Boolean logic, in which symbols or spaces are used to denote logical relationships. For example, +bears   +hibernation denotes AND logic. If you leave out the plus sign (+), most engines will perform an AND search for you.

Certain search engines allow you to use a proximity operator. This a type of AND logic which specifies the distance between words in a source file. For example, Exalead uses the NEAR operator. Consider this search: Bush NEAR budget. In Exalead, the two terms must be within 16 words of each other in the source file. Use of this option can help you gain relevance in your search results.

Most Web search engines cannot handle a single search statement that includes all the terms listed in Step 2 above. You may need to repeat your search a few times using terms in different combinations until you get results that are satisfactory. For example, you may start with BUSH, DEMOCRATS, BUDGET NEGOTIATIONS and connect these terms with AND logic. Take a look at your results. If you are not finding what you want, repeat the search with alternative keywords for the budget concept. Your initial results may give you ideas about which new terms to try.

For more information on formulating searches, see Boolean Searching on the Internet.

TIPS ON CONDUCTING SEARCHES

  1. Read the directions at each search site. The technique for formulating a search depends on the search engine you are using. There is a wide variety of options available among the different search engines.
  2. If you have a multi-term search, be sure to determine which type of Boolean logic you should use. For example, a search about the relationship between latitude and temperature can be formulated as:    +latitude   + temperature on many Web search engines in order for AND logic to apply.
  3. Include synonyms or alternate spellings in your search statements and connect these terms with OR logic.
  4. Check your spelling.
  5. Take advantage of capitalization if the search engine is case sensitive.
  6. If your results are not satisfactory, repeat the search using alternative terms.
  7. Try different sources to diversify your results. Sources can include other search engines and large directories.
  8. Experiment with different search engines. No two search engines work from the same index.
  9. Try search engines which allow you to search multiple search engines simultaneously. Be aware that you will lose access to advanced query options since not all engines offer them.
  10. If you have too many results, or results that are not relevant:
  11. Field search
  12. Add concept words to your original search.
  13. Use vocabulary that is specific to your topic; avoid words with large concepts unless you intend to field search.
  14. Link appropriate terms with the Boolean AND ( + ) so that each term is required to appear in the record. While many search engines do not require this, it doesn't hurt to be on the safe side.
  15. Use term proximity operators if they are available to locate documents in which your terms are close together. Exalead is one of the few engines nowadays that offers this.
  16. If one of your search terms is a phrase, be sure to enclose it within quotations, i.e., "global warming.a7quot;
  17. Use the Boolean NOT to keep out records containing terms you don't want.
  18. If you have too few results:
  19. Drop off the least important concept(s) to broaden your subject
  20. Use more general vocabulary
  21. Add alternate terms or spellings for individual concepts and connect with the Boolean OR
  22. Try the option available on some engines to find similar or related documents to one or more of your relevant hits.

Return to Top

Updated: 23 June 2008

Send comments to