Bare Minimum Internationalization of Software

[article]

I love internationalization (or “i18n”). I think one of the key promises of the Internet is that it brings every kind of information or service to everybody anywhere in the world, regardless of location or nationality. Companies that are marketing their software strictly to a domestic audience often think they don't have to be concerned with i18n. However, while the Internet allows us to reach out and connect with the rest of the world, it also allows the rest of the world to connect to us. While your efforts to sell software may only be domestic, the people it will find its way to—and the data that will find its way to it—are global.

Moreover, the notion of i18n as something that's only a concern when dealing with other nationalities is mistaken. Many a software team has had their first brush with character sets and internationalization when they discovered that their software turns the directional quotation marks (or “smart quotes”) emitted by Microsoft Word into gibberish. Indeed, the term “internationalization” is rather a misnomer. It's not preparing your software for going abroad; it's making it ready to work in a multilingual, multicultural world.

So, if you are testing a piece of software that's not going to be multilingualized or translated (at least, not yet), what is the bare minimum i18n it still needs to do?

First and foremost, the application must not crash or malfunction when fed Unicode or multibyte data. It may sound outrageous in this day and age, but some software reacts quite angrily to non-ASCII data. If you put an application on the public Internet, you can be certain that it will be accessed from all over the planet, whether that is your intent or not. And, as alluded to above, even in the case of an application deployed only on a company's intranet or desktop software with limited geographical distribution, "international" data will find its way into your application sooner or later. It needs to be able to cope.

Second, users must be able to use Unicode or multibyte data anywhere they can enter text into the application. This includes not only straightforward data fields such as the user's name, address, and so forth, but also things like passwords—especially since giving users a larger range of potential characters to select from increases the security of their passwords. In fact, any means by which textual data can be entered into the application needs to be Unicode aware. For instance, if the software you are testing can receive email or import Microsoft Word documents, those avenues of data ingress need to be tested as well. Things to watch for here include mojibake (for instance, characters landing in the database wrong), filtering out multibyte or non-ASCII characters, or input that's silently ignored altogether. Autocompletion mechanisms are often problematic. When fed non-ASCII text, they may offer incorrect suggestions or none at all. With passwords in particular, it's a good idea to make sure that users can create a word that includes non-ASCII characters, can log in with that same password, and can't log in with some common failure modes, like a blank password, the password with all non-ASCII characters stripped out, and so forth.

Third, users must be able to get out of the application whatever data they put into it, intact. Non-ASCII text must display correctly, be able to be edited, and be processed by the software without getting mangled. Mojibake is the most common issue. In addition to text that's displayed directly by your program, be sure to check output avenues like sending email or exporting documents to PDF, as well. Watch out for processes that either fail silently when fed non-ASCII data or die at the point where they encounter it.

When it comes to web applications, elements that are updated via AJAX are frequent offenders. Browsers may reinterpret the document based on the charset declared in the Content-Type header of the AJAX response, so you need to ensure not only that the updated elements handle non-ASCII characters correctly but also that nothing else on the page goes awry when the AJAX call fires.

Fourth, if your program handles dates or times, it almost certainly needs to be time-zone aware. Fortunately, most every programming language has either a standard or de facto standard library for dealing with date and time calculations. Thus, this issue usually boils down to advocating for the application to be made time zone aware if it isn't, and making sure that time zones are being used consistently if it is.

Finally, data-checking or filtering algorithms need to handle the widest possible range of inputs. Names in particular are very special things when it comes to data validation. You do someone a great disrespect when you tell him that his name is invalid or unacceptable. Patrick McKenzie's excellent and well-linked “Falsehoods Programmers Believe About Names” calls out many of the erroneous assumptions commonly made. Amongst other things, names can contain punctuation (D'Orazio, Mary-Lou), spaces (van Meegeren), or syntactically significant mixed case (McLean or, again, van Meegeren).

Phone numbers and addresses are an interesting problem. Their formatting tends to be fairly country-specific, but despite this, even applications that are primarily targeted at one particular locale frequently need to deal with international addressing. For instance, a testing conference taking place in one country is likely to lose a considerable amount of business if its registration web application cannot handle attendees with out-of-country billing addresses. The testing for this kind of functionality should start during the requirements-gathering and design processes. Together, testers, business stakeholders, and developers need to raise and answer questions like: What kinds of addresses do we need to accept? What is the business cost of not handling international addresses? What kind of validation is acceptable? Is simply accepting free-form text for these fields an option?

I18n is not merely a technical issue, but also a business issue. Again, this is one of the areas that make a strong case for testers being involved in the software specification process from the earliest possible point. When told that the software doesn't need to handle i18n, testers need to be in a position to ask what the risk to the business is and what opportunities will be missed as a result. The cost-to-value ratio may not work out in favor of a full-blown translation effort, but less-sweeping changes, including those above, can still make your software considerably more friendly to users from other countries. Is it really wise to pass up more of the international market than absolutely necessary? The world is a big place, and you do yourself a great disservice if you let your thinking be limited by national borders.

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.