Copyright ©2001-2002 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This primer sets out to explain the methods for internatinalising and localising Web pages. W3C's work in this area area is to make sure that formats and protocols are usable worldwide in all languages and in all writing systems. Commercial suppliers provide tools which apply these recommendations, many of which are referenced in this primer. The role of the primer is to guide those procuring web sites, as well as those designing and developing web sites to be able to make their web sites as accessible across the world as possible, and gain the largest possible audience for their web pages.
This primer is being produced by the UK and Ireland Office of W3C as a deliverable in the EU funded project Question How and does not conform to the W3C process for documents.
This document is being released for review by interested parties to encourage feedback and comments. This is the current state of an ongoing work on the primer.
This is a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use it as reference material or to cite as other than "work in progress".
Comments on this document are invited and should be sent to the editor m.d.wilson@w3.org.
Results of recent surveys of web pages and web usage by Global Reach and FUNDREDES show that the English language content of the Web is now down to 40% of the total web content. The major 60% is presented in other languages. Similarly, web users are now mostly non-native English speakers whose browsers default to the chracter set of another language. These figures are extrapolatable to show the rise of non-English languages on the web will continue - particularly in the Far Eastern languages.
The World Wide Web is becoming more "world wide" every day. Hardware and software is produced for the global market. It needs to be easy to create and process information for a wide range of audiences: to publish material and exchange data in Arabic, Chinese, French, Japanese, Korean, Hebrew, or Thai. Languages, writing systems, character codes, and other local conventions should not form barriers to W3C technology. The goal is to ensure that W3C's formats and protocols are usable worldwide in all languages and in all writing systems.
W3C has successfully stressed the role of Unicode as the base of the architecture of the Web. Recommendations from W3C for data formats and protocols use ISO 10646/Unicode to identify and describe characters. In implementations, Unicode is the hub for conversion between different character encodings. Once your data is in Unicode, it can be all handled in a uniform way and displayed, searched, sorted, and manipulated without fear of data corruption. Unicode covers virtually all legacy character repertoires, including ASCII, Latin-1, JIS X 0208, etc.
However, you have to state on your web pages which character set and which language your are using. Otherwise they may not be presented correctly.
Beyond character sets you have to internationalise and localise your pages to the cultural expectations of your users. Cultural diversity is too large a topic to be covered in this brief primer. To address the issue one simple example of the use of colour on web pages will be considered.
The expected growth of the number of languages on the Web is exponential. In 2001 English was still major language on the Web, by April 2002 English was only used to express 40% of web pages. The growth of non-English web pages and corresponding default language settings of users on browsers will continue to reduce the proportion of English on the Web, and with that the growth of non-English languages as the defult for users browsers.
The history of web languages began with the creation of the web. In 1989, Tim Berners-Lee and his associates at the research centre known as CERN (the French acronym for the European Laboratory for Particle Physics) in Geneva, Switzerland, invented a series of communications protocols that would present information in documents that could be linked to other documents and stored on computers throughout the Internet. He also developed the HyperText Markup Language to view create and view documents on the Web. The first Web documents were text-only and the browser used to retrieve and view these documents was a crude text reader.
The first publicly accessible Web site was created in 1993 when the National Centre for Superconducting Applications (NCSA) released an early UNIX version of the Mosaic Web browser. Marc Andreessen, who was at the time a student at the University of Illinois, invented mosaic. Mosaic used icons, pull-down menus, bit-mapped graphics and colourful links to display hypertext documents. Later in 1993, versions of Mosaic were created for the Macintosh and Windows operating systems. Because of this development, the Web exploded into the information revolution and cultural phenomenon we know it as today. As the popularity of the web increased, users and developers wanted to implement more and more functionality for the web. They grew tired of still images and stale web pages. They wanted to have animation and movement, and content generated on-the-fly. These desires prompted the development of most web languages that are around today. The class of Markup Languages, which began with HTML has grown to incorporate many different languages including XML, SGML, MathML, and others. The Common Gateway Interface(CGI) was designed to interface the web with external applications to be able to run code on a server machine. Many scripting languages were developed which would run in-line code on the client as well. The development of Java opened up a new door by enabling program code to be executed on any machine, regardless of its architectures.
Originally developers generally ignored character sets. Since one ANSI character set can handle Western European languages like English, French, German, Italian and Spanish, other languages were considered special cases or not handled at all.
Many, but not all of the world's major writing systems can be represented within 256 characters, using individual 8-bit character sets. It's important to note there isn't an 8-bit character set which can represent all of these languages at once, or even just the languages required by the European Union.
Languages which require more than 256 characters include: Chinese (Traditional and Simplified), Japanese, and Korean (Hangeul). It is a requirement, not an option, that any application which touches text in these languages needs to correctly handle DBCS or Unicode string processing and data.
The first issue to be addressed in that of character sets for each of the languages used in web pages.
Computers store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
Unicode changes this as Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.
The document character set for XML and HTML 4.0 is Unicode (aka ISO 10646). This means that HTML browsers and XML processors should behave as if they used Unicode internally. But it doesn't mean that documents have to be transmitted in Unicode. As long as client and server agree on the encoding, they can use any encoding that can be converted to Unicode.
Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.
Unicode is a 16-bit character set which contains all of the characters commonly used in information processing. Approximately 1/3 of the 64k possible code points are still unassigned, to allow room for adding additional characters in the future.
Unicode is not a technology in itself. Sometimes people misunderstand Unicode and expect it to 'solve' international engineering, which it doesn't. Unicode is an agreed upon way to store characters, a standard supported by members of the Unicode Consortium. ( e.g. by Microsoft )
The fundamental idea behind Unicode is to be language-independent, which helps conserve space in the character map - no single character is assumed to identify a language in itself. Just like a character "a" can be a French, German or English "a" even if they have different meanings, a particular Han ideograph might map to a character used in Chinese, Japanese and Korean. Sometimes native speakers of these languages misunderstand Unicode as not "looking" correct in Japanese for example, but that's intentional - appearance should reside in the font as an artistic issue, not the code point as an engineering issue. Although it's technically possible to ship one font which covers all Unicode characters, it would have very limited commercial use, since end-users in Asia will expect fonts dedicated and designed to look correct in their language.
Documents transmitted with HTTP that are of type text, such as text/html, text/plain, etc., can (and should!) have a charset parameter, which specifies the character encoding of the document. HTTP 1.1 says that the default charset is ISO-8859-1, but because there are still too many unlabeled documents in various encodings, browsers use the reader's preferred encoding when they don't get the information, on the assumption that most readers read documents in their own language. Therefore it is important to always label Web documents explicitly.
The line in the HTTP header typically looks like this:
< meta http-equiv="Content-Type" content="text/html charset=iso-8859-7">
Any character encoding that has been registered with IANA can be used, but it may be too much to ask of a browser to understand all of them. Some people have suggested limiting the allowed encodings to just ASCII, ISO-8859-1, UTF-8 and UTF-16. (See the http://www.w3.org/International/O-charset-list.html for an indicative list of encodings supported by major browsers.)
How to make the server send out appropriate 'charset' information depends on the server. Microsoft Internet Explorer uses the character set specified for a document to determine how to translate the bytes in the document into characters on the screen or on paper. By default, Internet Explorer uses the character set specified in the HTTP content type returned by the server to determine this translation. If this parameter is not given, Internet Explorer uses the character set specified by the meta element in the document. It uses the user's preferences if no meta element is specified.
To apply a character set to an entire document, you must insert the meta element before the body element. For clarity, it should appear as the first element after head, so that all browsers can translate the meta element before the document is parsed. The meta element applies to the document containing it. This means, for example, that a compound document (a document consisting of two or more documents in a set of frames) can use different character sets in different frames.
It is very important that the character encoding of any XML or (X)HTML document is clearly labeled. This can be done in the following ways:
For HTML, use the < meta > tag. Example:
< meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
These examples show how to include a varitety of different language texts in web pages.
If there is only one language on the page simply choose the correct charset for example:
< meta http-equiv="Content-Type" content="text/html charset=iso-8859-7">
Use a Common Font. Try to use a non-proprietary standard charset and font.
The following are examples of single strapline files.
Here is a link to W3C page of common charsets. http://www.w3.org/International/O-charset-lang.html
Link to charsets supported by IE5 http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/charsets/charset4.asp
A Unicode editor allows you to write in the font that you require and see the font displayed. MS-Windows example http://www.emurasoft.com/emeditor3/index.htm
You will also need to map the keyboard to the charset
A list of possible ones are available at: http://www.hclrss.demon.co.uk/unicode/utilities_fonts.html
Only a single charset is permitted per file. Therefore other charsets must be converted to a single charset - see section 3 on converting charsets.The recommended charset to use is utf-8 where each character is represented by a variable number of bytes according to UCS Transformation Format 8 defined in Annex P of the ammended (PDAM 1) ISO/IEC 10646-1:1993.
In the example below all text lines have been converted to utf-8.
| Language | W3C strapline |
| English | Leading the Web to its Full Potential... |
| Arabic |
لإيصال الشبكة المعلوماتية إلىأقصى إمكانياتها... |
| Catalan | Duent la Web al seu ple potencial ... |
| 中文/Chinese - Simplified | 引领网络充分发挥其潜能 |
| 中文/Chinese - Traditional | 引領網絡充分發揮其潛能 |
| Dutch |
Het Web tot zijn volle potentieel ontwikkelen... |
| French |
Amener le Web vers son plein potentiel... |
| German |
Alle Möglichkeiten des Web erschließen |
| Greek | Οδηγώντας τον παγκόμιο ιστό στο μέγιστο των δυνατοτήτων του... |
| Hebrew |
להוביל את הרשת למיצוי הפוטנציאל שלה... |
| Hungarian |
Hogy kihasználhassuk a Web nyújtotta összes lehetőséget... |
| Italian |
Sviluppare al massimo il potenziale del Web ... |
| 日本語/Japanese |
Webの可能性を最大限に導き出すために… |
| Korean |
웹의 모든 잠재력을 이끌어 내기 위하여... |
| Portugese |
Levando a Web em direcção ao seu potencial máximo... |
| Русский язык/Russian |
раскрывая весь потенциал Сети... |
| Spanish |
Guiando el web a su completo potencial... |
| Swedish |
Se till att webben når sin fulla potential ... |
Another example incorporating more multiple languages one single html page is available for the phrase "I can eat glass and it doesn't hurt me."
For XML, use the encoding pseudo-attribute in the xml declaration at the start of a document or the text declaration at the start of an entity. Example:
< ?xml version="1.0" encoding="iso-8859-1" ? >
Since SVG is an XML application it also uses the encoding pseudo-attribute in the xml declaration at the start of a document or the text declaration at the start of an entity. Example:
< ?xml version="1.0" encoding="iso-8859-1" ? >
To perform the conversion, a converter is required.
Convertion routines exist in languages such as: Java - http://java.sun.com/docs/books/tutorial/i18n/text/convertintro.html and Perl http://www.nihongo.org/snowhare/utilities/modules/unicode-maputf8/ and iconv package is available in GNU http://www.gnu.org/software/libiconv/ or as an ActiveX control - http://www.chilkatsoft.com/ChilkatIConv.asp and Mozilla http://www.mozilla.org/projects/intl/charset-converters.html
Tools are available (often based on the iconv package), these include Java application - http://www.mandarintools.com/zhcode.html On line tool - http://members.tripod.com/~LinkLab/CCU/ http://unicode.richard.eu.org/me/rch/ll.html MS-Windows Tool http://www.fingertipsoft.com/csconv/index.html
The last tool uses the conversion tables available from Unicode at ftp://ftp.unicode.org/Public/MAPPINGS/
Although it only converts single byte charsets - not multi-byte East Asian ones.
New ones will appear on the web, and to search for them they may be labelled using some combination of the following terms:
These converters allow you to convert from local charsets to utf-8.
Further articles on Character set conversion http://www.codeproject.com/cpp/unicode.asp http://www.microsoft.com/typography/unicode/cs.htm
Charset converter available at http://www.cyrillic.com/csconv/index.html
With the character set information declared in each document, clients can easily map these encodings to Unicode. In practice, a few encodings will be preferred which are non-proprietary international standards, most likely: ISO-8859-1 (Latin-1), US-ASCII, UTF-8, UTF-16, also the other encodings in the ISO-8859 series, iso-2022-jp, euc-kr, and so on.
If you are producing Web pages using a proprietary tool, then check that a charset encoding is used and check that it is a non-proprietary standard. It may be that an individual manfaturer will use a proprietary charset (e.g. charset=windows-1251 for Russian, charset=windows-874 for Thai). In this case, the only browser that can view this charset is that produced by that manufacturer - users with other browsers will see impenetrable rubbish on the screen.
|
|
Example screens produced using a Microsoft proprietary charset (charset=windows-1256) viewed on a Microsoft browser on a Microsoft system which makes it appear readable, but on a Netscape browser on a Linux system it is incomprehensible. The web page was available at http://news.bbc.co.uk/hi/arabic/news/
If you are producing web pages in English, you must still make the character set declarations. If you do not, then readers whose web browsers default to non-English character encodings will see your web pages as a jumble of incomprehensible strokes. These are becoming the majority of web users that you are not presenting your material to if you do not make a character set declaration. If you do, then their browsers will make the mapping and present the English text as you intended.
Microsoft provide a table containing information about the character sets supported by Internet Explorer 5.
Visual key board (http://office.microsoft.com/downloads/2000/viskeyboard.aspx )Microsoft Visual Keyboard is a program that supports typing in more than one language on the same computer by showing you a keyboard for another language on your screen. You might use Visual Keyboard when you change your keyboard layout from one language to another. When you change keyboard layouts, the characters you see as you type might not correspond to your keyboard. Visual Keyboard lets you see the keyboard for the language you've switched to on your screen so that you can either click the keys on your screen or see the correct keys to press to enter text.
For example, you might be working in an English version of Microsoft Word but want to type text in Greek. After you switch keyboard layouts from English to Greek, you can use Visual Keyboard to see the Greek keyboard layout on your screen. To enter Σ in your document, click Σ on the on-screen keyboard, or use Visual Keyboard as a map to press the keys on your keyboard that correspond to the on-screen keys.
Microsoft offers MultiLanguage Pack: each Office 2000 application includes an executable that supports most European, East Asian, and Bi-Directional languages. With the MultiLanguage Pack Office 2000 enables the creation of a single custom installation that works for every language included with the Pack. Further, Office 2000 includes more intelligent language tools and supports Unicode, making it easy for international users to share documents—without having to perform language-related file conversions. This document covers the ways that Office 2000 helps streamline operations for multinational organizations.
In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named xml:lang may be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document. In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are language identifiers as defined by [IETF RFC 1766], Tags for the Identification of Languages, or its successor on the IETF Standards Track.
Note: [IETF RFC 1766] tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166], or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES]. It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639].
For example:
< p xml:lang="en" > The quick brown fox jumps over the lazy dog. </p > < p xml:lang="en-GB" > What colour is it? </p > < p xml:lang="en-US" > What color is it? </p >
Full list of two letter country codes http://www.oasis-open.org/cover/iso639a.html
Although the English language is read from top left to bottom right on a page, many languages have other reading orders. Each of the web page languages provide some support for alternative reading orders - e.g. right to left in languages such as Hebrew or Arabic, or top to bottom.
Languages like Hebrew read right to left. If they are represented in the normal left to right way in HTML they read like:
להוביל
את הרשת
למיצוי
הפוטנציאל
שלה...
Whereas, if they are explicitely written to be presented right to left in HTML they appear like:
להוביל
את הרשת
למיצוי
הפוטנציאל
שלה...
The difference between these two is the inclusion in the second example of the text alignment in the style statement on the paragraph element:
<p dir=RTL style='text-align:right;direction:rtl;unicode-bidi:embed'>
Ruby allows Markup for Japanese, Chinese and other Asian scripts Ruby text is a run of text that appears in the immediate vicinity of another run of East Asian text, referred to as the base. Ruby text is often seen in Japanese magazines, and is heavily used in children's reading materials. A sequence of ideographic characters (kanji) is supplemented with the simpler hiragana which show how the word should be pronounced.
It is also possible to use SVG to present multiple languages on the same page. See the file internat_short.svg as an example.
< tspan x="15px" dy="30" style="fill:rgb(0,0,200)">
Duent la Web al seu ple potencial ...
< /tspan>
The code tspan is used to give an x and y postion for the text and a fill colour
Presenting the text, reading order in Arabic and Hebrew
< text x="550px" y="550px" style="fill:rgb(0,200,0);
writing-mode:rl;font-size:18;font-family:Arial Unicode MS">
لإيصالالشبكة المعلوماتية إلىأقصى إمكانياتها......
</text>
The code writing-mode is used to present text right to left(rl)
The writing orientation for Chinese and Japanese charecters can also be set using writing-mode of top to bottom (tb)
< text x="0px" y="70px" style="fill:rgb(0,200,200); writing-mode:tb; font-size:18;
font-family:'Arial Unicode MS','MS-Gothic','LucidaSansUnicode'">
引领网络充分发挥其潜能
</text>
It is also possible to rotate the character angle for western characters in a Japanese formate when presented vertically.
< text x="0px" y="300px" style="fill:rgb(200,0,0); writing-mode:tb;
glyph-orientation-vertical:0; font-size:18;font-family:Arial Unicode MS">
Webの可能性を最大限に導き出すために…
</text>
This done using glyph-orientation-vertical. Each individual letter of the word WEB is rotated to present it in the same way as the Japanese text.
If a charset statement is used it provides the mapping from the characters used to characters in a font. However, not all fonts will support full Unicode character sets. It is necessary to ensure that a font that does map to the chosen character set is available.
Bug in Adobe plug - the first character of a string is the only one tested for font compatibility, causes problems for languages that share fonts.
how to set up the servers for the main server types to send out the right information in the HTTP servers depending on local suffixes.
a separate description on language negotiation would be necessary. How to set up the server, how to set the defaults in various browsers and the like...
Search engines and other web agents are now becoming smarter about the language in which pages are written and how to present them to users.
What do text to speech machines do to non-English materials?
For more information on Accessibility issues see web pages http://www.w3.org/WAI/
Cultural differences in the interpretation of images, colors, symbols
To sell holidays in your country to a foreign citizen it is often better to present people of the target audience appearance in pictures that your ethnic group.
Some cultures require very conservative layouts to engender confidence, Others like a more graphics easy flowing modern style.
The data concepts of name, postal address, and telephone number are only a few of the most common (and many) that must be addressed in order to deploy effective global e-commerce solutions. Examples of common global data concepts, organized by globalisation dimensions:
Cultural (and Demographic)
Geographic
Economic
Regulatory
Throughout time some colors have acquired specific meanings. In Jon Van Eyck's Renaissance painting, Giovanni Arolfini and His Bride, the bride wears a green gown to symbolize fertility.
![]() |
Green also symbolized fertility in Celtic myth. The Green Man was the God of Fertility. Today, green is the universal symbol of nature and freshness and the contemporary symbol for ecologically beneficial.
| Color by Geography |
|
|
|
|
|
|
| Danger, Anger, Stop | Joy, Festive Occasions | Anger, Danger | Danger, Evil | |
| Caution, Cowardice | Honor, Royalty | Grace, Nobility, Childish, Gaiety | Happiness, Prosperity | |
| Sexual Arousal, Safe, Sour, Go | Youth, Growth | Future, Youth, Energy | Fertility, Strength | |
| Purity, Virtue | Mourning, Humility | Death, Mourning | Purity, Mourning | |
| Masculinity, Calm, Authority | Strength, Power | Villainy | ||
| Death, Evil | Evil | Evil | Mystery, Evil |
Is software available in your language or script?
If it is not, then the reason is probably that developers do not have enough information. Think about it: if you were an American developer and you want to add Chinese support (or Malaysian or Ukranian), where can you find it?
It is the responsibility of Government, academia, national standards bodies and people of goodwill to ensure that there is enough information available.
This article gives a checklist for the information that is required in the particular case of XML. But Governments would do well to start projects to collect and publish all the information needed for internationalization ("i18n") of software for their nation.
Before you get to XML...
Characters
Software
There is an ISO technical report which lists these kinds of information.
Typesetting
XML
it tells you how to make you pages readible across the world
[BERNERS-LEE98] What the Semantic Web can represent , Tim Berners-Lee, 1998 http://www.w3.org/DesignIssues/RDFnot.html
This document has benefited from inputs from many members of the W3C World Offices who provided valuable contributions to this document.