Collaborative Collection of Large Datasets
------------------------------------------

David S. Doermann

Center for Automation Research, Univ. Maryland
College Park, MD 20742 USA
doermann@cfar.umd.edu +1-301-405-4526 FAX 301-314-9115

One of the problems facing all areas of document analysis research is
the inability to obtain large amounts of ground truthed data at a
reasonable cost. Although some can argue that obtaining scanned images
of documents is relatively inexpensive, it is well known that the cost
of obtaining the complete and accurate "ground truth" necessary for
training and testing is often prohibitive.

Over the past decade, a number of datasets have been developed by
universities, government agencies and private companies, but they have
been either very limited in scope, limited in distribution or both.
Because of the cost involved, such sets are somewhat of a
commodity. For companies using them for development, the datasets
represent all that is difficult about the problems they are solving,
and therefor tend to hold them as trade secrets. In the few cases
where groups were paid to develop datasets for distribution, the data
can be obtained at a reasonable cost.

The problems that we are facing with a lack of large amounts of
standardized data effects both our current needs, as well as the
future of the field. Without adequate datasets basic training and
testing suffer now, but more importantly, there will be no incentive
for researchers to develop meaningful metrics for enhanced evaluation.
Since evaluation at some level is necessary to make any progress in
the field, we will eventually either plateau or expect significant
resources on the wrong problems. We need to be able to identify with
statistical confidence, what specific problems we have solved, track
how we have tried to solve them, and explore what problems remain
unsolved. The classic where have we been, where are we now and where
are we going trio.

In closely related areas, the speech recognition, information
retrieval and even on-line recognition communities excelled in
gathering large amounts of data at relatively low cost to the any one
organization. They often distribute the data for free, and provide
advanced metrics for evaluation. How can this be done? What is
different about our domain that we have not yet been able to come to
terms and produce such large sets?

We can argue that the type of ground truth for many applications in
document analysis is perhaps more involved then in other
domains. Speech recognition requires primarily transcription, IR
requires relevance judgments and online recognition also requires a
transcription. For standard character OCR, we also have fairly
straight forward requirements, namely character identities.
Unfortunately, for other document analysis tasks such as page
segmentation and logical analysis, the ground truth representation
required becomes more complex and the process for extracting the
ground truth more uncertain. While most document analysis researchers
tend to use small existing collections or generate minimal amounts of
data themselves to provide somewhat subject evaluations, it is
generally agreed that this is unacceptable in the long run.
Nevertheless, to obtain ground truthed data in any domain there are a
number of models that can be followed.

One way the community can obtain large amounts of data is to pay
someone to generate it. While we would still be required to develop
specifications, with enough funds, it could be done. Unfortunately, no
sponsor has yet to provide the appropriate level of finding, and it
can easily be argued that there is no way to produce a quality set
when we limit the process to a small group of people.

A second way is to consider generating synthetic data. The University
of Washington has provided tools to generate ground truth data from
synthetic images generated from LaTeX. The algorithms can be run on
the images, and complete and accurate ground truth is available.
Unfortunately, synthetic data by itself does not necessary accurately
reflect real data.

One hybrid way to generate data is to consider taking ground truthed
data, either synthetic or real, and corrupting it synthetically with
degradation models or physically by faxing, copying, stapling it, etc
and rescanning. Unfortunately the ground truth needs to be
re-registered with the corrupted data, and in many cases the parameters
of the ground truth may change. Nevertheless, researchers have
address such problems, and methods exist for reestablishing
correspondence.

Another way is to consider looking for corpuses that provide
information in both electronic and hard copy. This has been proposed
by groups at NIST (Garris - using the Federal Record), UMD (Kanungo -
using the Bible), as well as others. The registration problem must
still be addressed.

All of these options are feasible, yet somehow, they never seem to
scale well with the needs of the community. It is also very unlikely
that there will be a single source that is willing to step forward and
provide the funds necessary to generate such large amounts of data,
and since there will always be an argument over synthetic vs real
data, the question remains. Is there a way to begin to accumulate
data, which represents the needs of the community, and scales
appropriately? Other groups have done it, and we should be able to as
well. There is one solution, which for the most part ignores what is
collected, and considers more how it is collected.

Suppose that we are able to provide a service by which we can solicit
10 ground truthed documents from every researcher that wants access to
our collection, or even every researcher at this workshop. We could
have a significant collection. We could entertain proposals for the
type of ground truth and representations, and ask all contributors to
provide scanned image, formatted ground truth or even tools for
operating within our formats. In return, each contributor would be
able to access an appropriate fraction of the collection.

Researchers working on different problems could contribute different
types of ground truth such as segmentations and classifications from
the character up through the zone, page and complete document
levels. Ultimately, we could have a collection that is significant
enough that we could test algorithms on both synthetic and real data,
and even have enough to provide statically significant evaluations.

On Monday at ICDAR, David Stork will take these ideas much deeper in
his discussion of Open Mind. Open Mind seeks to provide a frame work
not only for collecting data, but also for specialists to provide
tools and "e-citizens" to provide "information" which can be messaged
into ground truth when a critical mass is obtained.

Of course there are a number of issues the must be addressed. What if
documents are from copyrighted sources? How can we obtain permission
to distribute such images?

On a more fundamental level, how do we know what is right? How do we
obtain a confidence? Do we really need to have 100% accuracy for the
data to be useful?

All of these questions have to be answered, not only for multi-lingual
problems, but for all document analysis problems. What other issues do
we need to address? I suspect some of the answers are in Open Mind,
but what else needs to be considered?