Methods of Market Research, Especially for Multi-lingual OCR
------------------------------------------------------------

Tim CALLAN

CAERE Corporation, 100 Cooper Ct, Los Gatos, CA 95128 USA
tim.callan@caere.com TEL +1-408-395-5148 FAX -5498


When you're working on the creation of a new technology, how
do you know what to build? How do you know which specific
features and functions are the most important? What will
the customers' expectations be for the way this technology
works? And most important of all, how big is the need for
this technology in the first place? All these questions
need to be answered if we will be confident that the techno-
logies we're creating will be successful in the world's
marketplaces. The way we answer these questions is through
market research.

Broadly defined, market research is the practice of
targeting a group of prospective customers who are identifi-
able in some way and empirically determining facts about
those people that are pertinent to the decisions we need to
make. Typically marketers research their customers' needs,
preferences, situations, and attitudes. In the particular
case of Caere, we put a lot of effort into understanding the
nature of the documents that different segments of customers
want to process; the content, origin, format, and size of
those documents; the workflow that a customer prefers; and
their expectations for accuracy, speed, format retention,
and other performance issues.

It's useful to divide market research into two main
types, primary and secondary research. Primary research
simply means that the original data points are assembled by
in the course of doing the research. Secondary research
relies on facts that were collected prior to your involve-
ment. Primary research is certainly more difficult, more
labor intensive, and more logistically challenging. Secon-
dary research requires much less effort but can be costly.
(Odd as it may sound, secondary research tends to be either
very expensive or free.) The main disadvantage with secon-
dary research is that many times no research exists that
exactly matches the questions you'd like to investigate.
Sometimes we can combine secondary market research with what
appear to be reasonable assumptions to extrapolate our
answers. While that may seem like sloppy science from a
purist's viewpoint, oftentimes it's the only practical way
to get at the questions that matter to us.

As a general rule, unless the cost is prohibitively
expensive, you should take advantage of secondary research
as far as possible. Oftentimes it's no more difficult than
buying a book, getting something from the library, or find-
ing the right Web page. (Please note that I maintain extra
skepticism over and above my normal critical instincts when
it comes to any information I glean from the Web. With the
barrier to publishing lower than ever before, it's all that
easier for bad science, misinformation, and disinformation
to find their way into someone's page.) Even then it is
very likely that you will reach the point where no relevant
data presents itself for the technology questions that
interest you. At that point the only available course is
either to conduct or to commission research.


FUN WITH STATISTICS

Most research takes the form of a collection of data
points. The researcher identifies a question that should be
answered and then conducts a large number of trials, records
the results of those trials, and plots them in any of an
almost unlimited number of ways. This sort of research,
conducted properly, is very reliable. Nonetheless, any
number of pitfalls can ruin your results and cause you to
reach erroneous conclusions. Here are a few things to look
for with numerical research results:

Statistical validity: Be careful about drawing conclu-
sions when you have too few trials. Let's take the simple
example of a yes/no question: Do you prefer to scan in
grayscale over B/W (black and white)? Let's say that 55% of
our respondents answered yes. We may be tempted to conclude
that grayscale is more popular than B/W. However, it could
be that only twenty people were surveyed, in which case a
single respondent changing her mind would even out the score
and a second changed mind would put B/W in the lead. Cer-
tainly we can't draw any useful conclusions based on this
data. On the other hand, it may be that one thousand peo-
ple were surveyed, in which case a 5% lead is plenty good
enough to conclude that grayscale is 99% likely to be more
popular than B/W. Any college stats textbook will give you
the formulas you need to calculate confidence levels at
various numbers of data.

Cross-tabulation: It's also tempting to segment your
results and then read data from there (What are the demo-
graphics of people who scan in grayscale?). Bear in mind
that the number of data points is dropping, often dramati-
cally, when you do so. And that opens the door to drawing
conclusions that are statistically invalid.

Confusing statistical facts with actual facts: Much
research is conducted at a 95% validity level, which is to
say that any individual conclusion is 95% likely to be
correct. That means that, on the average, one in twenty of
these conclusions will be wrong. So if you run a question-
naire with twenty questions on it, chances are very good
that at least one of your questions is giving you a wrong
answer. When you're working with statistics always bear in
mind that -- statistically -- they should be expected to
result in occasional errors.

Bias: It's easy to accidentally create your experiment
or survey in such a way that it biases the results. A broad
example would be if the above question were phrased, "You
don't like grayscale more than B/W, do you?" Now, bias
isn't ever completely avoidable (if nothing else, items have
to be listed in a specific order), but you should scrutinize
your work very carefully to make sure that your expected (or
desired) results aren't creeping into the research platform.

Sanity test: If your results look absolutely wacky,
that may be because they are. Look at your results with a
critical eye and see if they correspond to common sense.
When they do not, look for problems.


FUN WITHOUT STATISTICS

Statistically valid numbers, complete with percentages,
pie charts and the like, are all well and good. But there
are many occasions when the questions that are bothering us
are neither quantifiable nor measurable. That does not mean
that we have to give up the idea of research. Instead it
means that we need to consider research that yields direc-
tional results. Directional results are results that are
not concretely demonstrated in an unbiased, statistically
significant, scientifically accurate manner. They are not
proven to be fact. Nonetheless, directional results can be
very useful.

Directional research is named what it is because it
points in a direction rather than determines a fact. A good
example of directional research is the focus group. In a
focus group, a relatively small number of people who meet a
certain description are gathered together in a room. There
a moderator conducts what is essentially a guided conversa-
tion in which these people's opinions and reactions are
drawn out in response to specific questions. Certainly we
won't generate a bar chart out of this guided conversation.
Nonetheless, it can be an extraordinarily useful forum.

A lot of directional data come from venues that aren't
specifically research projects. Talking to potential custo-
mers, participating in trade shows, and listening in on cus-
tomer support lines are all great examples of how companies
get direction from their markets without conducting formal
research. Directional results are a very good way (some-
times the only practical way) to get an impression about the
more complex and nebulous questions that we face. But
always bear in mind that they are merely an impression.
They are much more likely to be off course than numerical
results are.


COMMON PRIMARY RESEARCH METHODS
When Caere conducts primary market research, it typi-
cally takes the form of a usage and attitude survey. That is
to say, Caere conducts a survey of a series of potential
customers with questions intended to understand their usage
of related computer products and their attitudes toward
those products and toward the work they need to do. Surveys
are likely the most powerful tool for primary research there
is.

Surveys can be conducted by any channel that allows a
consistent series of questions to be asked of a large number
of subjects. Traditionally surveys have been conducted by
mail or telephone. In recent years it's also become practi-
cal to conduct wide-scale market surveys by e-mail or using
the World Wide Web. In each case you create a questionnaire
that contains the exact same series of questions and then
make it available to a wide number of potential subjects.
Tabulate the results, use the correct statistical measures,
and you have a set of numbers that allow you to draw reli-
able conclusions about the sort of people who responded to
you survey.

This last phrase is very important. You can draw reli-
able conclusions about the sort of people who responded to
your survey. The more the market you're targeting differs
from your survey respondents, the more suspect your conclu-
sions. For example, suppose you post a survey in HTML on
your Web page, and over the course of the next few weeks you
receive one-hundred responses. Now let's say that one of
the questions in the survey involves what languages custo-
mers potentially want a new computer product to be in. And
furthermore let's suppose that 100% of the respondents
choose English as the language they want. Now, you can't
necessarily conclude that there is only demand for your pro-
duct in the English language. For instance, it may be that
your entire Web page, including the survey, is in English.
In that case you should expect the survey results to be very
strongly biased toward English over other languages. Most
times the bias in results will be more subtle (and difficult
to determine) than that. One common bias is that the people
who took the time to fill out your survey display a higher
level of involvement in the issues and technologies you're
exploring than the average member of the market does. You
should think about specific ways that your survey respon-
dents don't exactly correspond to the group of consumers for
whom you're developing.

Here are some common techniques for creating effective
surveys.

-- Try to make your questions as clear and unambiguous as
possible. It's easy to get bad data when your subjects
don't think you're asking what you think you are.

-- Test your surveys on an outsider before you send them
out. Get someone who wasn't involved in creating the
questions to make sure you asked them effectively.

-- Don't make the survey too long. You might lose atten-
tion and get worse answers (or fewer responses) late in
the survey.

-- Consider some kind of award for returning the survey
(like a randomly drawn give-away, for instance). That
will increase your overall response rate. But it might
also bias answers toward the positive (since you're
giving them something, people think more favorably
toward you than they otherwise would). Be sure to
adjust for that.

-- Always include clear instructions about what to do in
each step of the survey. Just because it's obvious to
you doesn't mean it's obvious to anyone else.

-- Give respondents the opportunity to say they don't
know. Recently I filled out a Web-based survey that
forced me to answer every question before I could sub-
mit it, even though for some of them I had no idea.
Although I was trying to help the surveyor out, I had
no choice but to give false data for several questions.
Don't make your research subjects do that same thing.

One important question you'll face with your market
research is whether to build or buy. In the case of more
limited projects, it certainly is possible to generate your
own primary data. Make up a survey, photocopy two hundred
of them, stamp them, and send them off. Tally the responses
to every answer, and you can have some reasonably reliable
answers to basic questions in a few weeks with very little
expense or effort. In a lot of circumstances that's all
you'l need. But sometimes the questions will be difficult,
or your time will be limited, or the stakes will be very
high. In these cases you might want to think about commis-
sioning a project with a professional research firm.

Professional research firms run the gamut from those
that will essentially send our your survey and tabulate the
results all the way up to very high-end firms that can con-
duct comprehensive research based on only the broadest
direction. In the area of computers and high technology
well reputed research firms include Dataquest
(www.dataquest.com), International Data Corporation
(www.idc.com), Forrester Research (www.forrester.com),
Gartner Group (www.gartner.com), and META Group
(www.metagroup.com). InfoTrends Research Group
(www.infotrends-rsi.com), located in Boston, is a research
and analysis group focused entirely toward electronic imag-
ing technologies, including OCR. InfoTrends offers a
variety of reports for the scanning and digital camera mark-
ets, and those reports typically range from $1500 to $6500
dollars. For questions oriented specifically toward Inter-
net issues, good prospects include Zona Research
(www.zonaresearch.com) and Jupiter Communications
(www.jup.com). One more potential resource is the Software
and Information Industry Association (SIIA; www.siia.net),
formerly the Software Publishers Association (SPA).

There are many, many other organizations that do
research, ranging all the way down to one-person shops. Many
of these companies can do good research for you, given the
proper direction. You should check out a research company
thoroughly (including asking to see previous questionnaires
and their results) before engaging it to represent you to
the outside world.


ISSUES SPECIFICALLY RELATED TO OCR AND LANGUAGES

Everything I've discussed so far can be applied to
questions that involve worldwide scanner adoption, OCR needs
in various countries, and multiple language OCR needs.
Unfortunately, it's an area that hasn't been heavily
researched to date. Objective, third party research on
existing software and hardware sales is available for the US
market from companies like PC Data (www.pcdata.com) and NPD
(www.npd.com). Unfortunately no equivalent exists that cov-
ers all the markets in the world.

Certainly this kind of research could be commissioned
by one of the major research analysts listed above. And cer-
tainly the academic community could play a role in assem-
bling these numbers. Until definitive, quantified results
are found we have to rely on existing data and the best
logic we can muster.

The rule of thumb in Western-oriented Windows software
is that 50% of all commercial business software, by volume,
is sold in the US and Canada. 15% is English language sold
outside North America (representing UK, Australia, and vari-
ous nations that have too little computer demand to get
localized versions of computer products). Another 13% is
German language product, sold primarily in Germany, Austria,
and Switzerland. French language makes up about 7%, and
most of the balance is split between (in descending order
size) Italian, Spanish, Dutch, Swedish, and Portuguese. In
Eastern languages, Japanese is number one today followed by
Korean and then simplified Chinese.

Please note that these are rules of thumb. Some
categories can perform very differently. Also note that
they constitute a snapshot of today's volumes, not of the
potential for the future. When forecasting the potential
need for a product based on conjectures from existing data,
you should adjust your assumptions based on the knowledge
you have of the situation. For example, let's say we're
trying to predict the sales breakdown of a scanner-related
product throughout the world. I would start with a break-
down that represents overall desktop computer penetration
(similar to the one in the previous paragraph). Then I
would adjust based on other things I know about computer use
throughout the world:

A) Expensive, advanced, new, or highly specialized com-
puter products tend to have a higher concentration in
the larger markets than computer products in general
do. Specifically in the US and Canada. They also have
a higher concentration in Scandinavia. In the last few
years scanners have become a mainstream peripheral, and
so I'd change my adjusted forecast little if any based
on that information.

B) France and Japan have disproportionately higher Macin-
tosh penetration than the rest of the world. South
America has disproportionately lower Macintosh penetra-
tion. If you're evaluating Mac versus Windows, these
are good adjustments to make.

C) As countries undergo economic boom, computer adoption
should rise dramatically. You might use that to adjust
your forecasts up for -- let's say -- eastern Europe,
if you feel that area should continue its economic pro-
gress.

D) The advent of the Euro should alter the functioning of
the European economy. That might increase volumes for
highly specialized, extremely new, or badly distributed
products in western Europe. I wouldn't adjust my gen-
eral forecast for scanner volumes based on that.

E) We might consider if actual physical or software
differences in the scanners available in various mark-
ets will result in significant variations in the qual-
ity of OCR's input images for these countries. It
turns out that each major scanner manufacturer sells
the same basic technology into all its market countries
and that there is no variability in marketshare that's
dramatic enough between countries to result in radi-
cally different scanning results across the broad base
of users. On the other hand, the qualities of docu-
ments vary dramatically from country to country. So I
would conclude that we'd rather focus on document types
in various markets than scanner types in those same
markets.

We could go on with this hypothetical adjustment, but
you'll have to apply your own common-sense alterations based
on the specifics of what you're forecasting. What matters
here is the methodology. And the understanding that when
you can't get the truly scientific results we all would
prefer, there's still an opportunity to move forward based
on intelligent assumptions.

Having spent most of the 1990's as a product manager or
senior manager for Windows and Macintosh computer companies,
I've seen first hand that market research is an essential
part of creating a successful computer product. Unfor-
tunately, this indispensable tool is underutilized and
poorly understood. I hope that in this short paper I have
given you the basics of what is a very deep subject. And I
hope that some of the practical tips that business uses to
work its market forecasts can work for you as well.
In fact, that's not an entirely true statement. We can
conclude with reasonable confidence that grayscale is not
universally liked less than B/W, nor is it universally liked
more. In some circumstances that's all you need to deter-
mine.