Reporting with the Internet - October 1999
This tip sheet is packed with plain-English details and links. Nora Paul, one of the world's leading trainers of journalists, explains The Hidden Web: What Search Engines Won't Find (and how to find them yourself). A companion file to this is "Web Chart" - an Acrobat Reader file that displays a search engine comparison chart Nora originally created in Excel. Nora has moved on from being known as the Web guru, based at Poynter, and is now the director of the Institute for New Media Studies, School of Journalism and Mass Communication at the University of Minnesota.

Back to Going Online

The Hidden Web: What Search Engines Won't Find
(and how to find them yourself)

The major search sites on the World Wide Web (like Yahoo!, AltaVista, Go, HotBot, Northern Light, etc.) are incredible resources.  But even the best of them index less than ¼  of the web pages available.  There is a “hidden” net that can hold some of the best resources and most helpful information.  This presentation goes into the secret info stashes to be aware of, how you can locate them, and how to use them.

What’s NOT in the search engine database?
Information recently added to a web site: It can take months sometimes before the spider takes another pass at a web site.  In the meantime, lots of valuable information can have been added that you won’t find in your web search.  Here are some techniques that can help


 Update Agents

    A service that monitors the web sites or pages you designate.   When  there is a change or addition to that site, a notice about the change is  sent to you.  Useful for:  Keeping track of releases for an agency you report on  when they don’t have an automatic alert service.
  • NetMind: http://www.netmind.com   “NetMind lets users track any Web page at any level of detail -- including  text, numbers, images, forms, links or keywords -- and be proactively notified via e-mail, pager, cell phone or PDA when the information changes.”  There is a free registration.
  • Informant: http://informant.dartmouth.edu/   Saves search engine queries and web sites (like a company or court's page), checks them periodically, and sends you e-mail whenever there are new or updated web pages. The Informant searches Alta Vista, Excite, Lycos and Infoseek.
  • Deja: Thread tracker:  http://www.deja.com  If you have signed-up for “My Deja”, you can have new messages that have been added to a message thread on a newsgroup you are tracking sent to you via e-mail.

Alert Services

    Alert services are found on individual web sites and are variously known as “watch lists”, “distribution lists”, “current awareness services”.  They use discussion list (listserv) software to maintain a list of people interested in updates and news releases that are then sent to their e-mail address.  Everyone who subscribes to the alert service receives the same documents or news releases.  Many government agencies and non-profit organizations have these sorts of services on their sites.  Useful for: Getting story ideas, keeping up on your beat.



Filters

    Filters provide customization.  They can be used to get news stories from wire services or messages from newsgroups that fit the subject profile you set up.  You sign-up and type in words about the types of topics you want the filter to catch. The software stores your interest profile and uses it to check the stories or messages it filters. When it finds a story or message containing the words you have in your profile, the program snags it and posts it to your e-mail box (or an area on the service you can check).  Useful for: Getting the latest information about a company or topic that you are tracking.

  • PR-Newswire: http://www.prnmedia.com/prnemail  On PRN, you create a profile that allows you to schedule e-mail delivery times, select the full-text, headline or abstract of press releases, select particular states, industries, subject areas or companies, or use advanced profile to use more specific terms. Press Line http://www.pressline.de/email-service/index.us.phtml  is a similar tool.
  • NewsIndex: http://newsindex.com/delivered.html  Set up a profile using subject keywords and have news stories from over 250 news sources around the world delivered to your email. Stories that appear to match your keywords are returned to you daily via email.
  • Quickbrowse: http://www.quickbrowse.com  As its title suggests, the service is meant to provide an easy and quick way of browsing the Net. A journalist who wanted to view the sites of all major US papers on a daily basis developed this site. Through Quickbrowse, one can combine multiple sites (say the feature sections of 5 newspaper) and then have those sites delivered to your e-mail at a particular time.
  • NewsTracker:  http://www.newstracker.com  “NewsTracker collects and filters thousands of late-breaking articles from a wide variety of online newspapers and magazines including the Los Angeles Times, Chicago Tribune, Forbes Digital, Advertising Age, and Russia Today.”



Databases internal to a website

    Spiders can’t get into databases on a web site, or anything that is retrieved on the fly from a search query on a site.  There are thousands of documents, articles, reports, and speeches found in these secluded stashes. Useful for: Finding background information and in-depth reports.  Here are a few ways to find them.
 
  • Invisible Web:  http://www.invisibleweb.com  Yahoo-like directory of databases found on websites.  Some items have the search box on the page, others link you to the web page.
  • Direct Search: http://www.freepint.com/gary/direct.htm  Over 1,000 links to web site databases, organized by archives, books, government, humanities, news sources, social sciences, legal, ready reference, business, science, subject-specific.

News archives

    Many newspaper web sites have an archive of stories, usually ones that had been in the print product, not necessarily on the web site. Spiders might index stories that were on a web page, but they can’t get into the database archive of stories.  Many of these will let you search for free, but will charge a fee for receipt of a full-text copy of the story.  Useful for: Getting background information, finding out if something happening locally has happened other places, locating experts, seeing if a story idea you have is fresh or has been done.
 

  • NewsTrawler:  http://www.newstrawler.com  A parallel search engine for news archives.  Click the papers you want trawled and it will  retrieve references to stories that fit the search you entered.

Commercial Information Services on the Web

    The stand-alone information services have migrated to the web and been joined by competitive web start-ups.  These huge information stores provide one-search shopping in archives of newspapers, magazines, transcripts of television and radio programs.  Their material dates back to the early 80s, even before. The contents of these services don’t get indexed by web search engines
 
  • Electric Library:  http://www.elibrary.com  Search the text of articles from magazines, newspapers, books and transcripts from around the world.  Set fee allows unlimited searching and article downloading.
  • Northern Light:  http://www.northernlight.com  This is a combination spider search of web pages and a “special collections” database with articles from publications.  Abstracts are free but there is a fee of $1 - $4 for full downloads of selected articles.
  • DIALOG:  http://www.dialogweb.com (if you have an account)(for info: http://www.dialog.com)  500 databases covering business, news, patents, trademarks, science and government. More than 100 U.S. papers. 221 unique files that don't appear anywhere else.
  • DOW JONES: http://www.dj.com  80-million articles from 6,000 publications, plus market research, analyst  reports and historical market data. Data can be output in variety of formats, including spreadsheets.
  • LEXIS-NEXIS:  http://www.lexis-nexis.com 1.4-billion news stories, legal documents, financial and market reports, legislative materials and more from 22,000 sources arranged into nearly 10,000 databases. Adds 4.6-million documents a week.


 

“People Finder” Files

    Put a name into a search engine, you’ll find web pages with that person’s name mentioned.  Sometimes, though, you want to locate the person (find an address, e-mail address or phone number).  People finder sites on the web can help you do that, but you have to go to them to do the search. Useful for: Locating people you need to contact.
 
  • Telephone Directories on the Web:  http://www.teldir.com  “The Internet's original and most detailed index of online phone books, with links to Yellow Pages, White Pages, Business Directories, Email Addresses and Fax Listings from all around the world.”
  • PeopleSearch:  http://peoplesearch.net  A meta-search site for people-finder databases.  Put in a name and it will go out and do the search in 10 different people-finder databases.  Each search will spawn a new browser window - to quickly close browser windows, click ALT-F4.
  • AnyWho, WhoWhere, InfoSpace, InfoUSA, PC411:  These are some of the largest phone and e-mail lookup sites on the web.

Library Catalogs

    Libraries, those original collectors and compilers of information, have great resources on their web sites, much of it sitting in their online catalogs.  These contents won’t be picked up by a spider.  Useful for: Locating experts by searching for the authors of books.  Verifying information.
 

Things search engines might find but you can’t get them


    The spider goes out, sees a web page, indexes it and puts information about the page in its database.  You come along and do a search.  The search results have what sounds like the perfect page for you.  Click on the link with great anticipation and get a “404 File not found” message.  Because of the time lag involved in the process of scanning, indexing and entry into the database, pages that were there when the spider came through might have been pulled by the time you do a search.  Tough luck….

    Except when you search the web using Google: http://www.google.com  Do a search in Google and if you come to a link that is no longer there, click the “cached” link at the end of the entry.  Google will retrieve a copy of the page as indexed from its cached page archive!
 
(Nora Paul, Poynter Institute Oct. 17, 1999)

Back to Going Online