W3C UK-I logo

Primer on the internationalisation and localisation of web pages

W3C UK and Ireland Office 1 December 2002

This version:
http://www.w3c.rl.ac.uk/QH/WP5/WD-int-primer-20020901.html
Editors:
Martin Prime, CCLRC, M.J.Prime@rl.ac.uk
Michael Wilson, CCLRC, M.D.Wilson@rl.ac.uk

Abstract

This primer sets out to explain the methods for internatinalising and localising Web pages. W3C's work in this area area is to make sure that formats and protocols are usable worldwide in all languages and in all writing systems. Commercial suppliers provide tools which apply these recommendations, many of which are referenced in this primer. The role of the primer is to guide those procuring web sites, as well as those designing and developing web sites to be able to make their web sites as accessible across the world as possible, and gain the largest possible audience for their web pages.

Status of this Document

This primer is being produced by the UK and Ireland Office of W3C as a deliverable in the EU funded project Question How and does not conform to the W3C process for documents.

Table of Contents

  1. Introduction
    1. Global Markets
    2. Languages - HTML, XHTML, SVG and other
  2. Character sets
    1. Unicode
    2. Character sets in HTML
      1. Multiple languages on one HTML page
    3. Character sets in XML
    4. Character sets in SVG
  3. Converting between Charsets
  4. Charsets and Browsers
  5. Typing charsets
  6. XML Lang
  7. Directionality and reading order
    1. Directionality and reading order in HTML
    2. Directionality and reading order in XML
    3. Directionality and reading order in SVG
  8. Fonts, charsets and Unicode
    1. Fonts, charsets and Unicode in SVG
    2. Fonts, charsets and Unicode in XML
    3. Fonts, charsets and Unicode in HTML
  9. Summary
  10. References
  11. Acknowledgments

1. Introduction

Results of recent surveys of web pages and web usage by Global Reach and FUNDREDES show that the English language content of the Web is now down to 40% of the total web content. The major 60% is presented in other languages. Similarly, web users are now mostly non-native English speakers whose browsers default to the chracter set of another language. These figures are extrapolatable to show the rise of non-English languages on the web will continue - particularly in the Far Eastern languages.

The World Wide Web is becoming more "world wide" every day. Hardware and software is produced for the global market. It needs to be easy to create and process information for a wide range of audiences: to publish material and exchange data in Arabic, Chinese, French, Japanese, Korean, Hebrew, or Thai. Languages, writing systems, character codes, and other local conventions should not form barriers to W3C technology. The goal is to ensure that W3C's formats and protocols are usable worldwide in all languages and in all writing systems.

W3C has successfully stressed the role of Unicode as the base of the architecture of the Web. Recommendations from W3C for data formats and protocols use ISO 10646/Unicode to identify and describe characters. In implementations, Unicode is the hub for conversion between different character encodings. Once your data is in Unicode, it can be all handled in a uniform way and displayed, searched, sorted, and manipulated without fear of data corruption. Unicode covers virtually all legacy character repertoires, including ASCII, Latin-1, JIS X 0208, etc.

However, you have to state on your web pages which character set and which language you are using. Otherwise they may not be presented correctly.

Beyond character sets you have to internationalise and localise your pages to the cultural expectations of your users. Cultural diversity is too large a topic to be covered in this brief primer. To address the issue one simple example of the use of colour on web pages will be considered.

1.1. Global Markets

The expected growth of the number of languages on the Web is exponential. In 2001 English was still the major language on the Web, by April 2002 English was only used to express 40% of web pages. The growth of non-English web pages and corresponding default language settings of users on browsers will continue to reduce the proportion of English on the Web, and with that the growth of non-English languages as the default for users browsers.

1.2. Languages - HTML, XHTML, SVG and other technologies.

The history of web languages began with the creation of the web. In 1989, Tim Berners-Lee and his associates at the research centre known as CERN (the French acronym for the European Laboratory for Particle Physics) in Geneva, Switzerland, invented a series of communications protocols that would present information in documents that could be linked to other documents and stored on computers throughout the Internet. He also developed the HyperText Markup Language to view create and view documents on the Web. The first Web documents were text-only and the browser used to retrieve and view these documents was a crude text reader.

The first publicly accessible Web site was created in 1993 when the National Centre for Superconducting Applications (NCSA) released an early UNIX version of the Mosaic Web browser. Marc Andreessen, who was at the time a student at the University of Illinois, invented mosaic. Mosaic used icons, pull-down menus, bit-mapped graphics and colourful links to display hypertext documents. Later in 1993, versions of Mosaic were created for the Macintosh and Windows operating systems. Because of this development, the Web exploded into the information revolution and cultural phenomenon we know it as today. As the popularity of the web increased, users and developers wanted to implement more and more functionality for the web. They grew tired of still images and stale web pages. They wanted to have animation and movement, and content generated on-the-fly. These desires prompted the development of most web languages that are around today. The class of Markup Languages, which began with HTML has grown to incorporate many different languages including XML, SGML, MathML, and others. The Common Gateway Interface(CGI) was designed to interface the web with external applications to be able to run code on a server machine. Many scripting languages were developed which would run in-line code on the client as well. The development of Java opened up a new door by enabling program code to be executed on any machine, regardless of its architectures.

Originally developers generally ignored character sets. Since one ANSI character set can handle Western European languages like English, French, German, Italian and Spanish, other languages were considered special cases or not handled at all.

Many, but not all of the world's major writing systems can be represented within 256 characters, using individual 8-bit character sets. It's important to note there isn't an 8-bit character set which can represent all of these languages at once, or even just the languages required by the European Union.

Languages which require more than 256 characters include: Chinese (Traditional and Simplified), Japanese, and Korean (Hangeul). It is a requirement, not an option, that any application which touches text in these languages needs to correctly handle DBCS or Unicode string processing and data.

2. Character sets

The first issue to be addressed in that of character sets for each of the languages used in web pages.

2.1 Unicode

Computers store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode changes this as Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.

The document character set for XML and HTML 4.0 is Unicode (aka ISO 10646). This means that HTML browsers and XML processors should behave as if they used Unicode internally. But it doesn't mean that documents have to be transmitted in Unicode. As long as client and server agree on the encoding, they can use any encoding that can be converted to Unicode.

Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.

Unicode is a 16-bit character set which contains all of the characters commonly used in information processing. Approximately 1/3 of the 64k possible code points are still unassigned, to allow room for adding additional characters in the future.

Unicode is not a technology in itself. Sometimes people misunderstand Unicode and expect it to 'solve' international engineering, which it doesn't. Unicode is an agreed upon way to store characters, a standard supported by members of the Unicode Consortium. ( e.g. by Microsoft )

The fundamental idea behind Unicode is to be language-independent, which helps conserve space in the character map - no single character is assumed to identify a language in itself. Just like a character "a" can be a French, German or English "a" even if they have different meanings, a particular Han ideograph might map to a character used in Chinese, Japanese and Korean. Sometimes native speakers of these languages misunderstand Unicode as not "looking" correct in Japanese for example, but that's intentional - appearance should reside in the font as an artistic issue, not the code point as an engineering issue. Although it's technically possible to ship one font which covers all Unicode characters, it would have very limited commercial use, since end-users in Asia will expect fonts dedicated and designed to look correct in their language.

2.2 Character sets in HTML

Documents transmitted with HTTP that are of type text, such as text/html, text/plain, etc., can (and should!) have a charset parameter, which specifies the character encoding of the document. HTTP 1.1 says that the default charset is ISO-8859-1, but because there are still too many unlabeled documents in various encodings, browsers use the reader's preferred encoding when they don't get the information, on the assumption that most readers read documents in their own language. Therefore it is important to always label Web documents explicitly.

The line in the HTTP header typically looks like this:

<meta http-equiv="Content-Type" content="text/html charset=iso-8859-7">

Any character encoding that has been registered with IANA can be used, but it may be too much to ask of a browser to understand all of them. Some people have suggested limiting the allowed encodings to just ASCII, ISO-8859-1, UTF-8 and UTF-16. (See the W3C for an indicative list of encodings supported by major browsers.)

How to make the server send out appropriate 'charset' information depends on the server. Microsoft Internet Explorer uses the character set specified for a document to determine how to translate the bytes in the document into characters on the screen or on paper. By default, Internet Explorer uses the character set specified in the HTTP content type returned by the server to determine this translation. If this parameter is not given, Internet Explorer uses the character set specified by the meta element in the document. It uses the user's preferences if no meta element is specified.

To apply a character set to an entire document, you must insert the meta element before the body element. For clarity, it should appear as the first element after head, so that all browsers can translate the meta element before the document is parsed. The meta element applies to the document containing it. This means, for example, that a compound document (a document consisting of two or more documents in a set of frames) can use different character sets in different frames.

It is very important that the character encoding of any XML or (X)HTML document is clearly labeled. This can be done in the following ways:

For HTML, use the < meta > tag. Example:

< meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

These examples show how to include a varitety of different language texts in web pages.

If there is only one language on the page simply choose the correct charset for example:

< meta http-equiv="Content-Type" content="text/html charset=iso-8859-7">

Use a Common Font. Try to use a non-proprietary standard charset and font.

The following are examples of single strapline files.


german.html
korean.html

W3C has a page of common charsets.

harsets supported by IE5

A Unicode editor allows you to write in the font that you require and see the font displayed.

You will also need to map the keyboard to the charset.

2.2.1 Multiple languages on one HTML page

Only a single charset is permitted per file. Therefore other charsets must be converted to a single charset - see section 3 on converting charsets.The recommended charset to use is utf-8 where each character is represented by a variable number of bytes according to UCS Transformation Format 8 defined in Annex P of the ammended (PDAM 1) ISO/IEC 10646-1:1993.

In the example below all text lines have been converted to utf-8.

Language W3C strapline
English Leading the Web to its Full Potential...
Arabic

لإيصال الشبكة المعلوماتية إلىأقصى إمكانياتها...

Catalan Duent la Web al seu ple potencial ...
中文/Chinese - Simplified 引领网络充分发挥其潜能
中文/Chinese - Traditional 引領網絡充分發揮其潛能
Dutch

Het Web tot zijn volle potentieel ontwikkelen...

French

Amener le Web vers son plein potentiel...

German

Alle Möglichkeiten des Web erschließen

Greek Οδηγώντας τον παγκόμιο ιστό στο μέγιστο των δυνατοτήτων του...
Hebrew

להוביל את הרשת למיצוי הפוטנציאל שלה...

Hungarian

Hogy kihasználhassuk a Web nyújtotta összes lehetőséget...

Italian

Sviluppare al massimo il potenziale del Web ...

日本語/Japanese

Webの可能性を最大限に導き出すために…

Korean

웹의 모든 잠재력을 이끌어 내기 위하여...

Portugese

Levando a Web em direcção ao seu potencial máximo...

Русский язык/Russian

раскрывая весь потенциал Сети...

Spanish

Guiando el web a su completo potencial...

Swedish

Se till att webben når sin fulla potential ...

Another example incorporating more multiple languages in one single html page is available for the phrase "I can eat glass and it doesn't hurt me."

2.3 Character sets in XML

For XML, use the encoding pseudo-attribute in the xml declaration at the start of a document or the text declaration at the start of an entity. Example:

< ?xml version="1.0" encoding="iso-8859-1" ? >

2.4 Character sets in SVG

Since SVG is an XML application it also uses the encoding pseudo-attribute in the xml declaration at the start of a document or the text declaration at the start of an entity. Example:

< ?xml version="1.0" encoding="iso-8859-1" ? >

3. Converting between Charsets

To perform the conversion, a converter is required.

Convertion routines exist in languages such as: Java and Perl and iconv package is available in GNU or as an ActiveX control and Mozilla

Tools are available (often based on the iconv package), these include Java application On line tool MS-Windows Tool

The last tool uses the conversion tables available from Unicode

Although it only converts single byte charsets - not multi-byte East Asian ones.

New ones will appear on the web, and to search for them they may be labelled using some combination of the following terms:


Character converter
Encoding translation
Charset conversion

These converters allow you to convert from local charsets to utf-8.

Further articles on Character set conversion

Charset converters are available

4 Charsets and Browsers

With the character set information declared in each document, clients can easily map these encodings to Unicode. In practice, a few encodings will be preferred which are non-proprietary international standards, most likely: ISO-8859-1 (Latin-1), US-ASCII, UTF-8, UTF-16, also the other encodings in the ISO-8859 series, iso-2022-jp, euc-kr, and so on.

If you are producing Web pages using a proprietary tool, then check that a charset encoding is used and check that it is a non-proprietary standard. It may be that an individual manfaturer will use a proprietary charset (e.g. charset=windows-1251 for Russian, charset=windows-874 for Thai). In this case, the only browser that can view this charset is that produced by that manufacturer - users with other browsers will see impenetrable rubbish on the screen.

Microsoft Charset (charset=windows-1256) on a Microsoft browser Microsoft Charset (charset=windows-1256)on a Linux browser

Example screens produced using a Microsoft proprietary charset (charset=windows-1256) viewed on a Microsoft browser on a Microsoft system which makes it appear readable, but on a Netscape browser on a Linux system it is incomprehensible. The web page was available at http://news.bbc.co.uk/hi/arabic/news/

If you are producing web pages in English, you must still make the character set declarations. If you do not, then readers whose web browsers default to non-English character encodings will see your web pages as a jumble of incomprehensible strokes. These are becoming the majority of web users that you are not presenting your material to if you do not make a character set declaration. If you do, then their browsers will make the mapping and present the English text as you intended.

Microsoft provides a table containing information about the character sets supported by Internet Explorer 5.

5. Typing charsets

Visual key board (http://office.microsoft.com/downloads/2000/viskeyboard.aspx )Microsoft Visual Keyboard is a program that supports typing in more than one language on the same computer by showing you a keyboard for another language on your screen. You might use Visual Keyboard when you change your keyboard layout from one language to another. When you change keyboard layouts, the characters you see as you type might not correspond to your keyboard. Visual Keyboard lets you see the keyboard for the language you've switched to on your screen so that you can either click the keys on your screen or see the correct keys to press to enter text.

For example, you might be working in an English version of Microsoft Word but want to type text in Greek. After you switch keyboard layouts from English to Greek, you can use Visual Keyboard to see the Greek keyboard layout on your screen. To enter £ in your document, click £ on the on-screen keyboard, or use Visual Keyboard as a map to press the keys on your keyboard that correspond to the on-screen keys.

Microsoft offers MultiLanguage Pack: each Office 2000 application includes an executable that supports most European, East Asian, and Bi-Directional languages. With the MultiLanguage Pack Office 2000 enables the creation of a single custom installation that works for every language included with the Pack. Further, Office 2000 includes more intelligent language tools and supports Unicode, making it easy for international users to share documents € without having to perform language-related file conversions.

6. XML Lang (Code for the Representation of the Names of Languages. From ISO 639, revision 19)

In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named xml:lang may be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document. In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are language identifiers as defined by [IETF RFC 1766], Tags for the Identification of Languages, or its successor on the IETF Standards Track.

Note: [IETF RFC 1766] tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166], or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES]. It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639].

For example:

< p xml:lang="en" > The quick brown fox jumps over the lazy dog. </p >
< p xml:lang="en-GB" > What colour is it? </p >
< p xml:lang="en-US" > What color is it? </p >

An unofficial list of two letter country codes

7. Directionality and reading order

The English language is read from top left to bottom right on a page, many languages have other reading orders. Each of the web page languages provide some support for alternative reading orders - e.g. right to left in languages such as Hebrew or Arabic, or top to bottom.

7.1 Directionality and reading order in HTML

Languages like Hebrew read right to left. If they are represented in the normal left to right way in HTML they read like:

להוביל את הרשת למיצוי הפוטנציאל שלה

Whereas, if they are explicitely written to be presented right to left in HTML they appear like:

להוביל את הרשת למיצוי הפוטנציאל שלה

The difference between these two is the inclusion in the second example of the text alignment in the style statement on the paragraph element:

<p dir=RTL style='text-align:right;direction:rtl;unicode-bidi:embed'>

Ruby allows Markup for Japanese, Chinese and other Asian scripts. Ruby text is a run of text that appears in the immediate vicinity of another run of East Asian text, referred to as the base. Ruby text is often seen in Japanese magazines, and is heavily used in children's reading materials. A sequence of ideographic characters (kanji) is supplemented with the simpler hiragana which show how the word should be pronounced.

7.2. Directionality and reading order in XML

7.3. Directionality and reading order in SVG

It is also possible to use SVG to present multiple languages on the same page. See the file internat_short.svg as an example.

< tspan x="15px" dy="30" style="fill:rgb(0,0,200)">
        Duent la Web al seu ple potencial ...
< /tspan>

The code tspan is used to give an x and y postion for the text and a fill colour

Presenting the text, reading order in Arabic and Hebrew

< text x="550px" y="550px" style="fill:rgb(0,200,0);
writing-mode:rl;font-size:18;font-family:Arial Unicode MS">
           لإيصالالشبكة المعلوماتية إلىأقصى إمكانياتها......
</text>

The code writing-mode is used to present text right to left(rl)

The writing orientation for Chinese and Japanese charecters can also be set using writing-mode of top to bottom (tb)

< text x="0px" y="70px" style="fill:rgb(0,200,200); writing-mode:tb; font-size:18;
font-family:'Arial Unicode MS','MS-Gothic','LucidaSansUnicode'">
       		引领网络充分发挥其潜能
</text>

It is also possible to rotate the character angle for western characters in a Japanese format when presented vertically.

< text x="0px" y="300px" style="fill:rgb(200,0,0); writing-mode:tb;
glyph-orientation-vertical:0; font-size:18;font-family:Arial Unicode MS">
       		Webの可能性を最大限に導き出すために…
</text>

This done using glyph-orientation-vertical. Each individual letter of the word WEB is rotated to present it in the same way as the Japanese text.

8. Fonts, charsets and Unicode

If a charset statement is used it provides the mapping from the characters used to characters in a font. However, not all fonts will support full Unicode character sets. It is necessary to ensure that a font that does map to the chosen character set is available.

8.1 Fonts, charsets and Unicode in SVG

Bug in Adobe plug - the first character of a string is the only one tested for font compatibility, causes problems for languages that share fonts.

15. Summary

This primer provides some methods for making web pages readible from a selection of web browsers around the world.

15. Acknowledgments

This document has benefited from inputs from many members of the W3C World Offices who provided valuable contributions to this document.