What seems to be the problem? When presented with the challenge of
finding data to meet e-discovery requests for legal purposes, IT
administrators may have to search both high and low to find all the data
and to put it into an analyzable format. But collecting the data into a
searchable repository is only part of the challenge. The second
challenge is to extract what you need, and hopefully only what you need,
from the potentially vast pile of information, i.e. a data haystack. And
going through that data haystack to find not only what you need, but
only what you need can prove to be a formidable challenge.
What do you need to know? When you search through a stack of documents,
you want only relevant documents identified by a search technique, but
you want all the relevant documents identified. Precision is the
proportion of retrieved and relevant documents to all documents
retrieved. (You do not want to have to separate the data wheat from the
data chaff especially if there are a lot of documents.) However, you
also want to identify all the documents that are relevant. Recall is the
proportion of relevant documents that are retrieved, out of all relevant
documents available. (You need to make sure that you get all of the data
wheat.) Unfortunately, there tends to be a tradeoff between precision
and recall in that there is a tendency for precision to decline as
recall increases. Your goal is to try to improve both precision and
recall simultaneously even though you may never be able to completely
reach your goal.
Now, powerful e-discovery search tools exist and they may be very
helpful in giving you both good precision and recall results. They may
contain full Boolean capability which means that you do not have to
search on single keywords, but rather use AND, OR, NOT, and NOR
combinations to help filter the data. Of course, many powerful search
algorithms are proprietary (although Boolean logic may still be used).
(Think Google.) But Boolean techniques are all about the association of
keywords. If you use too many keywords, you may find only relevant
documents, but not all relevant documents (a problem with recall). If
you use too few keywords, you may get back too many non-relevant
documents (a problem with precision).
Adding in the ability to search by category can help improve results.
Recommind is an example of a company that provides that type of
capability. (Recommind recently gave me a briefing as an industry
analyst.)
What is category analysis? Recommind uses the example of Java. A search
on Java would yield information on coffee, software, and Pacific
Islands. You need to categorize into categories from which you can then
select the relevant category. Recommind's software does this
automatically so that you can then identify the category that is the
relevant one for your requirements. (The categorization may not be as
obvious as it would be in the case of Java.) That should help both
precision and recall.
What can you do about it? Putting your users in the best possible
position to get what they need and only what they need out of the data
haystack is your challenge when selecting an e-discovery tool. You must
work with your users, such as your legal department, to select a number
of test cases that you can use to benchmark e-discovery tools against.
You must be able to measure the precision and recall of each of the
tools against each of the test cases. You may be able to get by with
simple Boolean analysis, but full Boolean analysis capability is likely
to be at least the minimum that you need. And, if that alone is not
sufficient, you can look at the other capabilities that the software
tool can provide and category analysis may be the type of capability
that you will feel is essential.