W3C UK-I logo

Primer on the internationalisation and localisation of web pages

W3C UK and Ireland Office Draft 1 September 2002

This version:
http://www.w3c.rl.ac.uk/QH/WP5/WD-int-primer-20020901.html
Editors:
Martin Prime, CLRC, M.J.Prime@rl.ac.uk
Michael Wilson, CLRC, M.D.Wilson@rl.ac.uk

Abstract

This primer sets out to explain the methods for internatinalising and localising Web pages. W3C's work in this area area is to make sure that formats and protocols are usable worldwide in all languages and in all writing systems. Commercial suppliers provide tools which apply these recommendations, many of which are referenced in this primer. The role of the primer is to guide those procuring web sites, as well as those designing and developing web sites to be able to make their web sites as accessible across the world as possible, and gain the largest possible audience for their web pages.

Status of this Document

This primer is being produced by the UK and Ireland Office of W3C as a deliverable in the EU funded project Question How and does not conform to the W3C process for documents.

This document is being released for review by interested parties to encourage feedback and comments. This is the current state of an ongoing work on the primer.

This is a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use it as reference material or to cite as other than "work in progress".

Comments on this document are invited and should be sent to the editor m.d.wilson@w3.org.

Table of Contents

  1. Introduction
    1. Global Markets
    2. Languages - HTML, XHTML, SVG and other
  2. Character sets
    1. Unicode
    2. Character sets in HTML
      1. Multiple languages on one HTML page
    3. Character sets in XML
    4. Character sets in SVG
  3. Converting between Charsets
  4. Charsets and Browsers
  5. Typing charsets
  6. XML Lang
  7. Directionality and reading order
    1. Directionality and reading order in HTML
    2. Directionality and reading order in XML
    3. Directionality and reading order in SVG
  8. Fonts, charsets and Unicode
    1. Fonts, charsets and Unicode in SVG
    2. Fonts, charsets and Unicode in XML
    3. Fonts, charsets and Unicode in HTML
  9. Internationalising Servers
    1. Language Negotiation on Servers
  10. Internationalising and localising Voice XML
  11. Search engines
  12. Accessibility
  13. Culture
    1. Internationalisation Dimensions
    2. Meaning of Color in Cultures
  14. A Checklist for Internationalising web pages
  15. Summary
  16. References
  17. Acknowledgments

1. Introduction

Results of recent surveys of web pages and web usage by Global Reach and FUNDREDES show that the English language content of the Web is now down to 40% of the total web content. The major 60% is presented in other languages. Similarly, web users are now mostly non-native English speakers whose browsers default to the chracter set of another language. These figures are extrapolatable to show the rise of non-English languages on the web will continue - particularly in the Far Eastern languages.

The World Wide Web is becoming more "world wide" every day. Hardware and software is produced for the global market. It needs to be easy to create and process information for a wide range of audiences: to publish material and exchange data in Arabic, Chinese, French, Japanese, Korean, Hebrew, or Thai. Languages, writing systems, character codes, and other local conventions should not form barriers to W3C technology. The goal is to ensure that W3C's formats and protocols are usable worldwide in all languages and in all writing systems.

W3C has successfully stressed the role of Unicode as the base of the architecture of the Web. Recommendations from W3C for data formats and protocols use ISO 10646/Unicode to identify and describe characters. In implementations, Unicode is the hub for conversion between different character encodings. Once your data is in Unicode, it can be all handled in a uniform way and displayed, searched, sorted, and manipulated without fear of data corruption. Unicode covers virtually all legacy character repertoires, including ASCII, Latin-1, JIS X 0208, etc.

However, you have to state on your web pages which character set and which language your are using. Otherwise they may not be presented correctly.

Beyond character sets you have to internationalise and localise your pages to the cultural expectations of your users. Cultural diversity is too large a topic to be covered in this brief primer. To address the issue one simple example of the use of colour on web pages will be considered.

1.1. Global Markets

The expected growth of the number of languages on the Web is exponential. In 2001 English was still major language on the Web, by April 2002 English was only used to express 40% of web pages. The growth of non-English web pages and corresponding default language settings of users on browsers will continue to reduce the proportion of English on the Web, and with that the growth of non-English languages as the defult for users browsers.

1.2. Languages - HTML, XHTML, SVG and other technologies.

The history of web languages began with the creation of the web. In 1989, Tim Berners-Lee and his associates at the research centre known as CERN (the French acronym for the European Laboratory for Particle Physics) in Geneva, Switzerland, invented a series of communications protocols that would present information in documents that could be linked to other documents and stored on computers throughout the Internet. He also developed the HyperText Markup Language to view create and view documents on the Web. The first Web documents were text-only and the browser used to retrieve and view these documents was a crude text reader.

The first publicly accessible Web site was created in 1993 when the National Centre for Superconducting Applications (NCSA) released an early UNIX version of the Mosaic Web browser. Marc Andreessen, who was at the time a student at the University of Illinois, invented mosaic. Mosaic used icons, pull-down menus, bit-mapped graphics and colourful links to display hypertext documents. Later in 1993, versions of Mosaic were created for the Macintosh and Windows operating systems. Because of this development, the Web exploded into the information revolution and cultural phenomenon we know it as today. As the popularity of the web increased, users and developers wanted to implement more and more functionality for the web. They grew tired of still images and stale web pages. They wanted to have animation and movement, and content generated on-the-fly. These desires prompted the development of most web languages that are around today. The class of Markup Languages, which began with HTML has grown to incorporate many different languages including XML, SGML, MathML, and others. The Common Gateway Interface(CGI) was designed to interface the web with external applications to be able to run code on a server machine. Many scripting languages were developed which would run in-line code on the client as well. The development of Java opened up a new door by enabling program code to be executed on any machine, regardless of its architectures.

Originally developers generally ignored character sets. Since one ANSI character set can handle Western European languages like English, French, German, Italian and Spanish, other languages were considered special cases or not handled at all.

Many, but not all of the world's major writing systems can be represented within 256 characters, using individual 8-bit character sets. It's important to note there isn't an 8-bit character set which can represent all of these languages at once, or even just the languages required by the European Union.

Languages which require more than 256 characters include: Chinese (Traditional and Simplified), Japanese, and Korean (Hangeul). It is a requirement, not an option, that any application which touches text in these languages needs to correctly handle DBCS or Unicode string processing and data.

2. Character sets

The first issue to be addressed in that of character sets for each of the languages used in web pages.

2.1 Unicode

Computers store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode changes this as Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.

The document character set for XML and HTML 4.0 is Unicode (aka ISO 10646). This means that HTML browsers and XML processors should behave as if they used Unicode internally. But it doesn't mean that documents have to be transmitted in Unicode. As long as client and server agree on the encoding, they can use any encoding that can be converted to Unicode.

Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.

Unicode is a 16-bit character set which contains all of the characters commonly used in information processing. Approximately 1/3 of the 64k possible code points are still unassigned, to allow room for adding additional characters in the future.

Unicode is not a technology in itself. Sometimes people misunderstand Unicode and expect it to 'solve' international engineering, which it doesn't. Unicode is an agreed upon way to store characters, a standard supported by members of the Unicode Consortium. ( e.g. by Microsoft )

The fundamental idea behind Unicode is to be language-independent, which helps conserve space in the character map - no single character is assumed to identify a language in itself. Just like a character "a" can be a French, German or English "a" even if they have different meanings, a particular Han ideograph might map to a character used in Chinese, Japanese and Korean. Sometimes native speakers of these languages misunderstand Unicode as not "looking" correct in Japanese for example, but that's intentional - appearance should reside in the font as an artistic issue, not the code point as an engineering issue. Although it's technically possible to ship one font which covers all Unicode characters, it would have very limited commercial use, since end-users in Asia will expect fonts dedicated and designed to look correct in their language.

2.2 Character sets in HTML

Documents transmitted with HTTP that are of type text, such as text/html, text/plain, etc., can (and should!) have a charset parameter, which specifies the character encoding of the document. HTTP 1.1 says that the default charset is ISO-8859-1, but because there are still too many unlabeled documents in various encodings, browsers use the reader's preferred encoding when they don't get the information, on the assumption that most readers read documents in their own language. Therefore it is important to always label Web documents explicitly.

The line in the HTTP header typically looks like this:

< meta http-equiv="Content-Type" content="text/html charset=iso-8859-7">

Any character encoding that has been registered with IANA can be used, but it may be too much to ask of a browser to understand all of them. Some people have suggested limiting the allowed encodings to just ASCII, ISO-8859-1, UTF-8 and UTF-16. (See the http://www.w3.org/International/O-charset-list.html for an indicative list of encodings supported by major browsers.)

How to make the server send out appropriate 'charset' information depends on the server. Microsoft Internet Explorer uses the character set specified for a document to determine how to translate the bytes in the document into characters on the screen or on paper. By default, Internet Explorer uses the character set specified in the HTTP content type returned by the server to determine this translation. If this parameter is not given, Internet Explorer uses the character set specified by the meta element in the document. It uses the user's preferences if no meta element is specified.

To apply a character set to an entire document, you must insert the meta element before the body element. For clarity, it should appear as the first element after head, so that all browsers can translate the meta element before the document is parsed. The meta element applies to the document containing it. This means, for example, that a compound document (a document consisting of two or more documents in a set of frames) can use different character sets in different frames.

It is very important that the character encoding of any XML or (X)HTML document is clearly labeled. This can be done in the following ways:

For HTML, use the < meta > tag. Example:

< meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

These examples show how to include a varitety of different language texts in web pages.

If there is only one language on the page simply choose the correct charset for example:

< meta http-equiv="Content-Type" content="text/html charset=iso-8859-7">

Use a Common Font. Try to use a non-proprietary standard charset and font.

The following are examples of single strapline files.


german.html
korean.html

Here is a link to W3C page of common charsets. http://www.w3.org/International/O-charset-lang.html

Link to charsets supported by IE5 http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/charsets/charset4.asp

A Unicode editor allows you to write in the font that you require and see the font displayed. MS-Windows example http://www.emurasoft.com/emeditor3/index.htm

You will also need to map the keyboard to the charset

A list of possible ones are available at: http://www.hclrss.demon.co.uk/unicode/utilities_fonts.html

2.2.1 Multiple languages on one HTML page

Only a single charset is permitted per file. Therefore other charsets must be converted to a single charset - see section 3 on converting charsets.The recommended charset to use is utf-8 where each character is represented by a variable number of bytes according to UCS Transformation Format 8 defined in Annex P of the ammended (PDAM 1) ISO/IEC 10646-1:1993.

In the example below all text lines have been converted to utf-8.

Language W3C strapline
English Leading the Web to its Full Potential...
Arabic

لإيصال الشبكة المعلوماتية إلىأقصى إمكانياتها...

Catalan Duent la Web al seu ple potencial ...
中文/Chinese - Simplified 引领网络充分发挥其潜能
中文/Chinese - Traditional 引領網絡充分發揮其潛能
Dutch

Het Web tot zijn volle potentieel ontwikkelen...

French

Amener le Web vers son plein potentiel...

German

Alle Möglichkeiten des Web erschließen

Greek Οδηγώντας τον παγκόμιο ιστό στο μέγιστο των δυνατοτήτων του...
Hebrew

להוביל את הרשת למיצוי הפוטנציאל שלה...

Hungarian

Hogy kihasználhassuk a Web nyújtotta összes lehetőséget...

Italian

Sviluppare al massimo il potenziale del Web ...

日本語/Japanese

Webの可能性を最大限に導き出すために…

Korean

웹의 모든 잠재력을 이끌어 내기 위하여...

Portugese

Levando a Web em direcção ao seu potencial máximo...

Русский язык/Russian

раскрывая весь потенциал Сети...

Spanish

Guiando el web a su completo potencial...

Swedish

Se till att webben når sin fulla potential ...

Another example incorporating more multiple languages one single html page is available for the phrase "I can eat glass and it doesn't hurt me."

2.3 Character sets in XML

For XML, use the encoding pseudo-attribute in the xml declaration at the start of a document or the text declaration at the start of an entity. Example:

< ?xml version="1.0" encoding="iso-8859-1" ? >

2.4 Character sets in SVG

Since SVG is an XML application it also uses the encoding pseudo-attribute in the xml declaration at the start of a document or the text declaration at the start of an entity. Example:

< ?xml version="1.0" encoding="iso-8859-1" ? >

3. Converting between Charsets

To perform the conversion, a converter is required.

Convertion routines exist in languages such as: Java - http://java.sun.com/docs/books/tutorial/i18n/text/convertintro.html and Perl http://www.nihongo.org/snowhare/utilities/modules/unicode-maputf8/ and iconv package is available in GNU http://www.gnu.org/software/libiconv/ or as an ActiveX control - http://www.chilkatsoft.com/ChilkatIConv.asp and Mozilla http://www.mozilla.org/projects/intl/charset-converters.html

Tools are available (often based on the iconv package), these include Java application - http://www.mandarintools.com/zhcode.html On line tool - http://members.tripod.com/~LinkLab/CCU/ http://unicode.richard.eu.org/me/rch/ll.html MS-Windows Tool http://www.fingertipsoft.com/csconv/index.html

The last tool uses the conversion tables available from Unicode at ftp://ftp.unicode.org/Public/MAPPINGS/

Although it only converts single byte charsets - not multi-byte East Asian ones.

New ones will appear on the web, and to search for them they may be labelled using some combination of the following terms:


Character converter
Encoding translation
Charset conversion

These converters allow you to convert from local charsets to utf-8.

Further articles on Character set conversion http://www.codeproject.com/cpp/unicode.asp http://www.microsoft.com/typography/unicode/cs.htm

Charset converter available at http://www.cyrillic.com/csconv/index.html

4 Charsets and Browsers

With the character set information declared in each document, clients can easily map these encodings to Unicode. In practice, a few encodings will be preferred which are non-proprietary international standards, most likely: ISO-8859-1 (Latin-1), US-ASCII, UTF-8, UTF-16, also the other encodings in the ISO-8859 series, iso-2022-jp, euc-kr, and so on.

If you are producing Web pages using a proprietary tool, then check that a charset encoding is used and check that it is a non-proprietary standard. It may be that an individual manfaturer will use a proprietary charset (e.g. charset=windows-1251 for Russian, charset=windows-874 for Thai). In this case, the only browser that can view this charset is that produced by that manufacturer - users with other browsers will see impenetrable rubbish on the screen.

Microsoft Charset (charset=windows-1256) on a Microsoft browser Microsoft Charset (charset=windows-1256)on a Linux browser

Example screens produced using a Microsoft proprietary charset (charset=windows-1256) viewed on a Microsoft browser on a Microsoft system which makes it appear readable, but on a Netscape browser on a Linux system it is incomprehensible. The web page was available at http://news.bbc.co.uk/hi/arabic/news/

If you are producing web pages in English, you must still make the character set declarations. If you do not, then readers whose web browsers default to non-English character encodings will see your web pages as a jumble of incomprehensible strokes. These are becoming the majority of web users that you are not presenting your material to if you do not make a character set declaration. If you do, then their browsers will make the mapping and present the English text as you intended.

Microsoft provide a table containing information about the character sets supported by Internet Explorer 5.

5. Typing charsets

Visual key board (http://office.microsoft.com/downloads/2000/viskeyboard.aspx )Microsoft Visual Keyboard is a program that supports typing in more than one language on the same computer by showing you a keyboard for another language on your screen. You might use Visual Keyboard when you change your keyboard layout from one language to another. When you change keyboard layouts, the characters you see as you type might not correspond to your keyboard. Visual Keyboard lets you see the keyboard for the language you've switched to on your screen so that you can either click the keys on your screen or see the correct keys to press to enter text.

For example, you might be working in an English version of Microsoft Word but want to type text in Greek. After you switch keyboard layouts from English to Greek, you can use Visual Keyboard to see the Greek keyboard layout on your screen. To enter Σ in your document, click Σ on the on-screen keyboard, or use Visual Keyboard as a map to press the keys on your keyboard that correspond to the on-screen keys.

Microsoft offers MultiLanguage Pack: each Office 2000 application includes an executable that supports most European, East Asian, and Bi-Directional languages. With the MultiLanguage Pack Office 2000 enables the creation of a single custom installation that works for every language included with the Pack. Further, Office 2000 includes more intelligent language tools and supports Unicode, making it easy for international users to share documents—without having to perform language-related file conversions. This document covers the ways that Office 2000 helps streamline operations for multinational organizations.

6. XML Lang (Code for the Representation of the Names of Languages. From ISO 639, revision 19)

In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named xml:lang may be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document. In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are language identifiers as defined by [IETF RFC 1766], Tags for the Identification of Languages, or its successor on the IETF Standards Track.

Note: [IETF RFC 1766] tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166], or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES]. It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639].

For example:

< p xml:lang="en" > The quick brown fox jumps over the lazy dog. </p >
< p xml:lang="en-GB" > What colour is it? </p >
< p xml:lang="en-US" > What color is it? </p >

Full list of two letter country codes http://www.oasis-open.org/cover/iso639a.html

7. Directionality and reading order

Although the English language is read from top left to bottom right on a page, many languages have other reading orders. Each of the web page languages provide some support for alternative reading orders - e.g. right to left in languages such as Hebrew or Arabic, or top to bottom.

7.1 Directionality and reading order in HTML

Languages like Hebrew read right to left. If they are represented in the normal left to right way in HTML they read like:

להוביל את הרשת למיצוי הפוטנציאל שלה...

Whereas, if they are explicitely written to be presented right to left in HTML they appear like:

להוביל את הרשת למיצוי הפוטנציאל שלה...

The difference between these two is the inclusion in the second example of the text alignment in the style statement on the paragraph element:

<p dir=RTL style='text-align:right;direction:rtl;unicode-bidi:embed'>

Ruby allows Markup for Japanese, Chinese and other Asian scripts Ruby text is a run of text that appears in the immediate vicinity of another run of East Asian text, referred to as the base. Ruby text is often seen in Japanese magazines, and is heavily used in children's reading materials. A sequence of ideographic characters (kanji) is supplemented with the simpler hiragana which show how the word should be pronounced.

See also http://www.microsoft.com/mind/defaulttop.asp?page=/mind/1099/localize/localize.htm&nav=/mind/1099/inthisissuecolumns1099.htm

7.2. Directionality and reading order in XML

7.3. Directionality and reading order in SVG

It is also possible to use SVG to present multiple languages on the same page. See the file internat_short.svg as an example.

< tspan x="15px" dy="30" style="fill:rgb(0,0,200)">
        Duent la Web al seu ple potencial ...
< /tspan>

The code tspan is used to give an x and y postion for the text and a fill colour

Presenting the text, reading order in Arabic and Hebrew

< text x="550px" y="550px" style="fill:rgb(0,200,0);
writing-mode:rl;font-size:18;font-family:Arial Unicode MS">
           لإيصالالشبكة المعلوماتية إلىأقصى إمكانياتها......
</text>

The code writing-mode is used to present text right to left(rl)

The writing orientation for Chinese and Japanese charecters can also be set using writing-mode of top to bottom (tb)

< text x="0px" y="70px" style="fill:rgb(0,200,200); writing-mode:tb; font-size:18;
font-family:'Arial Unicode MS','MS-Gothic','LucidaSansUnicode'">
       		引领网络充分发挥其潜能
</text>

It is also possible to rotate the character angle for western characters in a Japanese formate when presented vertically.

< text x="0px" y="300px" style="fill:rgb(200,0,0); writing-mode:tb;
glyph-orientation-vertical:0; font-size:18;font-family:Arial Unicode MS">
       		Webの可能性を最大限に導き出すために…
</text>

This done using glyph-orientation-vertical. Each individual letter of the word WEB is rotated to present it in the same way as the Japanese text.

8. Fonts, charsets and Unicode

If a charset statement is used it provides the mapping from the characters used to characters in a font. However, not all fonts will support full Unicode character sets. It is necessary to ensure that a font that does map to the chosen character set is available.

8.1 Fonts, charsets and Unicode in SVG

Bug in Adobe plug - the first character of a string is the only one tested for font compatibility, causes problems for languages that share fonts.

9. Internationalising Servers

how to set up the servers for the main server types to send out the right information in the HTTP servers depending on local suffixes.

9.1 Language Negotiation on Servers

a separate description on language negotiation would be necessary. How to set up the server, how to set the defaults in various browsers and the like...

10. Internationalising and localising Voice XML

11. Search engines

Search engines and other web agents are now becoming smarter about the language in which pages are written and how to present them to users.

12. Accessibility

What do text to speech machines do to non-English materials?

For more information on Accessibility issues see web pages http://www.w3.org/WAI/

13. Culture

Cultural differences in the interpretation of images, colors, symbols

To sell holidays in your country to a foreign citizen it is often better to present people of the target audience appearance in pictures that your ethnic group.

Some cultures require very conservative layouts to engender confidence, Others like a more graphics easy flowing modern style.

13.1 Internationalisation Dimensions

The data concepts of name, postal address, and telephone number are only a few of the most common (and many) that must be addressed in order to deploy effective global e-commerce solutions. Examples of common global data concepts, organized by globalisation dimensions:

Cultural (and Demographic)

  • Person Name(cultural, ethnic, religious)
  • Honorifics, Titles, Suffixes
  • Business Profession, Title
  • Language (international and industry standard coding, dialect)
  • Marital Status

Geographic

  • Address (international postal address structure)
  • Country (international and industry standard geography coding)
  • Time Zone (international standards, seasonal offsets)

Economic

  • Currency
  • Bank Account Numbering (structure and format)
  • Taxation Type (description and coding)

Regulatory

  • Privacy Acts (international, recommendations, commercial)
  • Restrictions regarding the solicitation, collection, storage, and use of demographic data
  • Patent, Copyright and Trademark
For a full list of common global data concepts, organized by globalisation dimensions see: http://www.globalwebarch.com/freesite/index.html

13.2 Meaning of Color in Cultures

Throughout time some colors have acquired specific meanings. In Jon Van Eyck's Renaissance painting, Giovanni Arolfini and His Bride, the bride wears a green gown to symbolize fertility.

Giovanni Arnolfini and His Bride
by Jan Van Eyck , 1434
image of picture

Green also symbolized fertility in Celtic myth. The Green Man was the God of Fertility. Today, green is the universal symbol of nature and freshness and the contemporary symbol for ecologically beneficial.

Color by Geography
 Color
 Western Europe & USA
 China
Japan
Middle East
  Danger, Anger, Stop Joy, Festive Occasions Anger, Danger  Danger, Evil
  Caution, Cowardice Honor, Royalty Grace, Nobility, Childish, Gaiety Happiness, Prosperity
  Sexual Arousal, Safe, Sour, Go Youth, Growth Future, Youth, Energy Fertility, Strength
  Purity, Virtue Mourning, Humility Death, Mourning Purity, Mourning
  Masculinity, Calm, Authority Strength, Power Villainy  
  Death, Evil Evil Evil Mystery, Evil


For more information see http://library.thinkquest.org/50065/index.html

14. A Checklist for Internationalizing web pages

Is software available in your language or script?

If it is not, then the reason is probably that developers do not have enough information. Think about it: if you were an American developer and you want to add Chinese support (or Malaysian or Ukranian), where can you find it?

It is the responsibility of Government, academia, national standards bodies and people of goodwill to ensure that there is enough information available.

This article gives a checklist for the information that is required in the particular case of XML. But Governments would do well to start projects to collect and publish all the information needed for internationalization ("i18n") of software for their nation.

Before you get to XML...

Characters

  • Standardize character repertoire and representative glyphs
  • Standardize character names
  • Standardize character collation sequence
  • Standardize other character properties (spacing, punctuation)
  • Standardize character encoding
  • Standardize names for the script which use these characters
  • Standardize names for the languages which use these scripts
  • Standardize names for all variant scripts and dialects
  • Submit the characters, properties and names to ISO, for inclusion into ISO 10646, ISO 639, and ISO 3166, to the Unicode Consortium, to IANA, and to the Java people.

Software

  • Create reference character set translation utilities between your
  • character set and UTF-8
  • Create a standard keyboard layout, and sample input method specification
  • Create a reference implementation of the input method software for the major operating systems

There is an ISO technical report which lists these kinds of information.

Typesetting

  • Create reference font
  • Study and list all typesetting customs and traditions in your locale, language or script. Make this list available to ISO SC 34 (DSSSL) and W3C style-sheet Working Group.
  • Create standard element types for any unique typesetting customs.

XML

  • Create a list of translations of standard XML or SGML terminology
  • Create a FAQ relating XML issues to your language and script in particular
  • Create test files featuring your language and character encodings
  • Create reference ports for leading back-end text processing software: SP, PERL
  • Put all this on the WWW at a central location: put it all in your language and in English. Register your site with Robin Cover's SGML & XML Website.
  • Provide a forum for feedback and discussion
  • Contact leading (Western) developers to discuss what information they require
  • Create standard web pages for error messages in your language, which products can link to for instant localization

15. Summary

it tells you how to make you pages readible across the world

15. Acknowledgments

This document has benefited from inputs from many members of the W3C World Offices who provided valuable contributions to this document.

-----------------------------------------------------------------