storage.itworld.com
  Search  
Storage Home Page Storage Webcasts Storage White Papers Storage Newsletters Storage News Storage Topics Careers ITworld Voices ITwhirled The Storage site of ITworld.com
Storage Tip: Choosing an e-discovery tool
STORAGE.ITWORLD.COM --- 07/10/2007

David Hill

What seems to be the problem? When presented with the challenge of finding data to meet e-discovery requests for legal purposes, IT administrators may have to search both high and low to find all the data and to put it into an analyzable format. But collecting the data into a searchable repository is only part of the challenge. The second challenge is to extract what you need, and hopefully only what you need, from the potentially vast pile of information, i.e. a data haystack. And going through that data haystack to find not only what you need, but only what you need can prove to be a formidable challenge. 

On this topic

What do you need to know? When you search through a stack of documents, you want only relevant documents identified by a search technique, but you want all the relevant documents identified. Precision is the proportion of retrieved and relevant documents to all documents retrieved. (You do not want to have to separate the data wheat from the data chaff especially if there are a lot of documents.) However, you also want to identify all the documents that are relevant. Recall is the proportion of relevant documents that are retrieved, out of all relevant documents available. (You need to make sure that you get all of the data wheat.) Unfortunately, there tends to be a tradeoff between precision and recall in that there is a tendency for precision to decline as recall increases. Your goal is to try to improve both precision and recall simultaneously even though you may never be able to completely reach your goal.

Now, powerful e-discovery search tools exist and they may be very helpful in giving you both good precision and recall results. They may contain full Boolean capability which means that you do not have to search on single keywords, but rather use AND, OR, NOT, and NOR combinations to help filter the data. Of course, many powerful search algorithms are proprietary (although Boolean logic may still be used). (Think Google.) But Boolean techniques are all about the association of keywords. If you use too many keywords, you may find only relevant documents, but not all relevant documents (a problem with recall). If you use too few keywords, you may get back too many non-relevant documents (a problem with precision).

Adding in the ability to search by category can help improve results. Recommind is an example of a company that provides that type of capability. (Recommind recently gave me a briefing as an industry analyst.)

What is category analysis? Recommind uses the example of Java. A search on Java would yield information on coffee, software, and Pacific Islands. You need to categorize into categories from which you can then select the relevant category. Recommind's software does this automatically so that you can then identify the category that is the relevant one for your requirements. (The categorization may not be as obvious as it would be in the case of Java.) That should help both precision and recall.

What can you do about it? Putting your users in the best possible position to get what they need and only what they need out of the data haystack is your challenge when selecting an e-discovery tool. You must work with your users, such as your legal department, to select a number of test cases that you can use to benchmark e-discovery tools against. You must be able to measure the precision and recall of each of the tools against each of the test cases. You may be able to get by with simple Boolean analysis, but full Boolean analysis capability is likely to be at least the minimum that you need. And, if that alone is not sufficient, you can look at the other capabilities that the software tool can provide and category analysis may be the type of capability that you will feel is essential.

 

David Hill is the founder and principal at the Mesabi Group. The Mesabi Group is an industry analyst firm that focuses on networked storage and storage management. The second edition of the Mesabi Group report "Data Protection: Adapting to the Sea Change" (http://www.mesabigroup.com/English/Portfolio/Portfolio.html) is now available. Hill was VP of Storage Research and founded the Storage and Storage Management practice at Aberdeen Group, leading quantitative and qualitative market research. He directed data centers at Data General, introducing new analytical tools and business systems. He handled strategic marketing, competitive analysis, sales force planning, and market forecasting at a well-known storage vendor. He has an advanced degree from MIT's Sloan School. He can be reached at: davidhill@mesabigroup.com. Please visit the Mesabi Group Web site at: http://www.mesabigroup.com/



Advertisements
Sponsored links
Top 5 Reasons to Combine App Performance and Security
KODAK i1400 Series Scanners stand up to the challenge
Bring harmony to your mix of UNIX-Linux-Windows computing environments
Locate Hidden Software on business PCs with this free tool
 Home   Newsletters  STORAGE.ITWORLD.COM
www.itworld.com    open.itworld.com     security.itworld.com     smallbusiness.itworld.com
storage.itworld.com     utilitycomputing.itworld.com     wireless.itworld.com

 
Contact Us   About Us   Privacy Policy    Terms of Service   Reprints  

CIO   Computerworld   CSO   GamePro   Games.net   IDG Connect   IDG World Expo   Infoworld   ITworld   JavaWorld   LinuxWorld  MacUser   Macworld   Network World   PC World   Playlist  

Copyright © Computerworld, Inc. All rights reserved

Reproduction in whole or in part in any form or medium without express written permission of Computerworld Inc. is prohibited. Computerworld and Computerworld.com and the respective logos are trademarks of International Data Group Inc.