pytho****@googl*****
pytho****@googl*****
2011年 5月 20日 (金) 16:17:33 JST
3 new revisions: Revision: 9924099de4c5 Author: Akihiro Uchida <uchid****@ike-d*****> Date: Wed May 18 13:35:26 2011 Log: 原文の更新を差分適用 http://code.google.com/p/python-doc-ja/source/detail?r=9924099de4c5 Revision: 67be0327a8fe Author: Akihiro Uchida <uchid****@ike-d*****> Date: Fri May 20 00:14:47 2011 Log: translate howto/unicode.rst http://code.google.com/p/python-doc-ja/source/detail?r=67be0327a8fe Revision: e94aea17f93c Author: Akihiro Uchida <uchid****@ike-d*****> Date: Fri May 20 00:15:24 2011 Log: merge http://code.google.com/p/python-doc-ja/source/detail?r=e94aea17f93c ============================================================================== Revision: 9924099de4c5 Author: Akihiro Uchida <uchid****@ike-d*****> Date: Wed May 18 13:35:26 2011 Log: 原文の更新を差分適用 http://code.google.com/p/python-doc-ja/source/detail?r=9924099de4c5 Modified: /howto/unicode.rst ======================================= --- /howto/unicode.rst Sat Dec 4 02:43:38 2010 +++ /howto/unicode.rst Wed May 18 13:35:26 2011 @@ -210,11 +210,12 @@ to reading the Unicode character tables, available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>. -Two other good introductory articles were written by Joel Spolsky -<http://www.joelonsoftware.com/articles/Unicode.html> and Jason Orendorff -<http://www.jorendorff.com/articles/unicode/>. If this introduction didn't make -things clear to you, you should try reading one of these alternate articles -before continuing. +Another good introductory article was written by Joel Spolsky +<http://www.joelonsoftware.com/articles/Unicode.html>. +If this introduction didn't make things clear to you, you should try reading this +alternate article before continuing. + +.. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken Wikipedia entries are often helpful; see the entries for "character encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8 @@ -471,7 +472,7 @@ from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol, other". See -<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values> for a +<http://unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values> for a list of category codes. References ============================================================================== Revision: 67be0327a8fe Author: Akihiro Uchida <uchid****@ike-d*****> Date: Fri May 20 00:14:47 2011 Log: translate howto/unicode.rst http://code.google.com/p/python-doc-ja/source/detail?r=67be0327a8fe Modified: /howto/unicode.rst ======================================= --- /howto/unicode.rst Wed May 18 13:35:26 2011 +++ /howto/unicode.rst Fri May 20 00:14:47 2011 @@ -1,93 +1,187 @@ -*********************** - Unicode HOWTO (英語) -*********************** +***************** + Unicode HOWTO +***************** :Release: 1.02 -This HOWTO discusses Python's support for Unicode, and explains various problems -that people commonly encounter when trying to work with Unicode. - -Introduction to Unicode -======================= - -History of Character Codes --------------------------- - -In 1968, the American Standard Code for Information Interchange, better known by -its acronym ASCII, was standardized. ASCII defined numeric codes for various -characters, with the numeric values running from 0 to -127. For example, the lowercase letter 'a' is assigned 97 as its code -value. - -ASCII was an American-developed standard, so it only defined unaccented -characters. There was an 'e', but no 'é' or 'Í'. This meant that languages -which required accented characters couldn't be faithfully represented in ASCII. -(Actually the missing accents matter for English, too, which contains words such -as 'naïve' and 'café', and some publications have house styles which require -spellings such as 'coöperate'.) - -For a while people just wrote programs that didn't display accents. I remember -looking at Apple ][ BASIC programs, published in French-language publications in -the mid-1980s, that had lines like these:: +.. + This HOWTO discusses Python's support for Unicode, and explains various problems + that people commonly encounter when trying to work with Unicode. + +この HOWTO 文書は Python の Unicode サポートについて論じ、 +さらに Unicode を使おうというときによくでくわす多くの問題について説明しま す。 + +.. + Introduction to Unicode + ======================= + +Unicode 入門 +============ + +.. + History of Character Codes + -------------------------- + +文字コードの歴史 +---------------- + +.. + In 1968, the American Standard Code for Information Interchange, better known by + its acronym ASCII, was standardized. ASCII defined numeric codes for various + characters, with the numeric values running from 0 to + 127. For example, the lowercase letter 'a' is assigned 97 as its code + value. + +1968年に American Standard Code for Information Interchange が標準化されま した。 +頭文字の ASCII でよく知られています。 +ASCII は0から127までの、異なる文字の数値コードを定義しています。 +例えば、小文字の 'a' にはコード値 97 が割り当てられています。 + +.. + ASCII was an American-developed standard, so it only defined unaccented + characters. There was an 'e', but no 'é' or 'Í'. This meant that languages + which required accented characters couldn't be faithfully represented in ASCII. + (Actually the missing accents matter for English, too, which contains words such + as 'naïve' and 'café', and some publications have house styles which require + spellings such as 'coöperate'.) + +ASCII はアメリカの開発標準だったのでアクセント無しの文字のみを定義していま した。 +'e' はありましたが、'é' や 'Í' はありませんでした。 +つまり、アクセント付きの文字を必要とする言語は ASCII できちんと表現するとが できません。 +(実際には英語でもアクセント無しという問題はありました、 +'naïve' や 'café' のようなアクセントを含む単語や、 +いくつかの出版社は 'coöperate' のような独自のスタイルのつづりを必要とするな ど) + +.. + For a while people just wrote programs that didn't display accents. I remember + looking at Apple ][ BASIC programs, published in French-language publications in + the mid-1980s, that had lines like these:: + +しばらくの間は単のアクセントが表示されないプログラムを書きました。 +1980年半ばのフランス語で出版された Apple ][ の BASIC プログラムを見た記憶を 辿ると、 +そこにはこんな行がありました:: PRINT "FICHER EST COMPLETE." PRINT "CARACTERE NON ACCEPTE." -Those messages should contain accents, and they just look wrong to someone who -can read French. - -In the 1980s, almost all personal computers were 8-bit, meaning that bytes could -hold values ranging from 0 to 255. ASCII codes only went up to 127, so some -machines assigned values between 128 and 255 to accented characters. Different -machines had different codes, however, which led to problems exchanging files. -Eventually various commonly used sets of values for the 128-255 range emerged. -Some were true standards, defined by the International Standards Organization, -and some were **de facto** conventions that were invented by one company or -another and managed to catch on. - -255 characters aren't very many. For example, you can't fit both the accented -characters used in Western Europe and the Cyrillic alphabet used for Russian -into the 128-255 range because there are more than 127 such characters. - -You could write files using different codes (all your Russian files in a coding -system called KOI8, all your French files in a different coding system called -Latin1), but what if you wanted to write a French document that quotes some -Russian text? In the 1980s people began to want to solve this problem, and the -Unicode standardization effort began. - -Unicode started out using 16-bit characters instead of 8-bit characters. 16 -bits means you have 2^16 = 65,536 distinct values available, making it possible -to represent many different characters from many different alphabets; an initial -goal was to have Unicode contain the alphabets for every single human language. -It turns out that even 16 bits isn't enough to meet that goal, and the modern -Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in -base-16). - -There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were -originally separate efforts, but the specifications were merged with the 1.1 -revision of Unicode. - -(This discussion of Unicode's history is highly simplified. I don't think the -average Python programmer needs to worry about the historical details; consult -the Unicode consortium site listed in the References for more information.) - - -Definitions ------------ - -A **character** is the smallest possible component of a text. 'A', 'B', 'C', -etc., are all different characters. So are 'È' and 'Í'. Characters are -abstractions, and vary depending on the language or context you're talking -about. For example, the symbol for ohms (Ω) is usually drawn much like the -capital letter omega (Ω) in the Greek alphabet (they may even be the same in -some fonts), but these are two different characters that have different -meanings. - -The Unicode standard describes how characters are represented by **code -points**. A code point is an integer value, usually denoted in base 16. In the -standard, a code point is written using the notation U+12ca to mean the -character with value 0x12ca (4810 decimal). The Unicode standard contains a lot -of tables listing characters and their corresponding code points:: +.. + Those messages should contain accents, and they just look wrong to someone who + can read French. + +これらのメッセージはアクセントを含むべきで、 +フランス語を読める人から見ると単に間違いとみなされます。 + +.. + In the 1980s, almost all personal computers were 8-bit, meaning that bytes could + hold values ranging from 0 to 255. ASCII codes only went up to 127, so some + machines assigned values between 128 and 255 to accented characters. Different + machines had different codes, however, which led to problems exchanging files. + Eventually various commonly used sets of values for the 128-255 range emerged. + Some were true standards, defined by the International Standards Organization, + and some were **de facto** conventions that were invented by one company or + another and managed to catch on. + +1980年代には、多くのパーソナルコンピューターは 8-bit でした、 +つまり 8-bit で 0-255 までの値を確保することができました。 +ASCII コードは 127 までだったので、いくつかのマシンは 128 から 255 の値を +アクセント付きの文字に割り当てました。 +マシン毎に異なる文字コードを持っていました、 +しかし、そのせいでファイル交換の問題が起きました。 +結局、128-255 の間の値はよく使われる集合がたくさん現われることになりまし た。 +そのうちいくつかは International Standards Organzation の定める本当の標準に なり、 +またいくつかは一社で開発され、別の会社へと流行することで **事実上の** 慣習 となりました。 + +.. + 255 characters aren't very many. For example, you can't fit both the accented + characters used in Western Europe and the Cyrillic alphabet used for Russian + into the 128-255 range because there are more than 127 such characters. + +255文字というのは十分多い数ではありません。 +例えば、西ヨーロッパで使われるアクセント付き文字とロシアで使われるキリルア ルファベットの両方は +127文字以上あるので、128-255の間におさめることはできません。 + +.. + You could write files using different codes (all your Russian files in a coding + system called KOI8, all your French files in a different coding system called + Latin1), but what if you wanted to write a French document that quotes some + Russian text? In the 1980s people began to want to solve this problem, and the + Unicode standardization effort began. + +異なる文字コードを使ってファイルを作成することは可能です +(持っているロシア語のファイル全てを KOI8 と呼ばれるコーディングシステムで、 +持っているフランス語のファイル全てを別の Latin1 と呼ばれるコーディングシス テムにすることで)、 +しかし、ロシア語の文章を引用するフランス語の文章を書きたい場合にはどうでし ょう? +1989年代にこの問題を解決したいという要望が上って、Unicode 標準化の努力が始 まりました。 + +.. + Unicode started out using 16-bit characters instead of 8-bit characters. 16 + bits means you have 2^16 = 65,536 distinct values available, making it possible + to represent many different characters from many different alphabets; an initial + goal was to have Unicode contain the alphabets for every single human language. + It turns out that even 16 bits isn't enough to meet that goal, and the modern + Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in + base-16). + +Unicode は 8-bit の文字の代わりに 16-bit の文字を使うことにとりかかりまし た。 +16bit 使うということは 2^16 = 65,536 の異なる値が利用可能だということを意味 します、 +これによって多くの異なるアルファベット上の多くの異なる文字を表現することが できます; +最初の目標は Unicode が人間が使う個々の言語のアルファベットを含むことでし た。 +あとになってこの目標を達成するには 16bit でさえも不十分であることがわかりま した、 +そして最新の Unicode 規格は 0-1,114,111 (16進表記で 0x10ffff) までの +より広い文字コードの幅を使っています。 + +.. + There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were + originally separate efforts, but the specifications were merged with the 1.1 + revision of Unicode. + +関連する ISO 標準も ISO 10646 があります。Unicode と ISO 10646 は元々独立し た成果でしたが、 +Unicode の 1.1 リビジョンで仕様を併合しました。 + +.. + (This discussion of Unicode's history is highly simplified. I don't think the + average Python programmer needs to worry about the historical details; consult + the Unicode consortium site listed in the References for more information.) + +(この Unicode の歴史についての解説は非常に単純化しています。 +平均的な Python プログラマは歴史的な詳細を気にする必要は無いと考えています; +より詳しい情報は参考文献に載せた Unicode コンソーシアムのサイトを参考にして 下さい。) + +.. + Definitions + ----------- + +定義 +---- + +.. + A **character** is the smallest possible component of a text. 'A', 'B', 'C', + etc., are all different characters. So are 'È' and 'Í'. Characters are + abstractions, and vary depending on the language or context you're talking + about. For example, the symbol for ohms (Ω) is usually drawn much like the + capital letter omega (Ω) in the Greek alphabet (they may even be the same in + some fonts), but these are two different characters that have different + meanings. + +**文字** は文章の構成要素の中の最小のものです。'A', 'B', 'C' などは全て異な る文字です。 +'È' や 'Í' も同様に異なる文字です。 +文字は抽象的な概念で、言語や文脈に依存してさまざまに変化します。 +例えば、オーム(Ω) はふつう大文字ギリシャ文字のオメガ (Ω) で書かれますが +(これらはいくつかのフォントで全く同じ書体かもしれません) +しかし、これらは異なる意味を持つ異なる文字とみなされます。 + +.. + The Unicode standard describes how characters are represented by **code + points**. A code point is an integer value, usually denoted in base 16. In the + standard, a code point is written using the notation U+12ca to mean the + character with value 0x12ca (4810 decimal). The Unicode standard contains a lot + of tables listing characters and their corresponding code points:: + +Unicode 標準は文字が **コードポイント (code points)** でどう表現するかを記 述しています。 +コードポイントは整数値で、ふつう16進表記で書かれます。 +標準的にはコードポイントは U+12ca のような表記を使って書かれます、 +U+12ca は 0x12ca (10進表記で 4810) を意味しています。 +Unicode 標準は文字とコードポイントを対応させる多くのテーブルを含んでいま す:: 0061 'a'; LATIN SMALL LETTER A 0062 'b'; LATIN SMALL LETTER B @@ -95,155 +189,299 @@ ... 007B '{'; LEFT CURLY BRACKET -Strictly, these definitions imply that it's meaningless to say 'this is -character U+12ca'. U+12ca is a code point, which represents some particular -character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In -informal contexts, this distinction between code points and characters will -sometimes be forgotten. - -A character is represented on a screen or on paper by a set of graphical -elements that's called a **glyph**. The glyph for an uppercase A, for example, -is two diagonal strokes and a horizontal stroke, though the exact details will -depend on the font being used. Most Python code doesn't need to worry about -glyphs; figuring out the correct glyph to display is generally the job of a GUI -toolkit or a terminal's font renderer. - - -Encodings ---------- - -To summarize the previous section: a Unicode string is a sequence of code -points, which are numbers from 0 to 0x10ffff. This sequence needs to be -represented as a set of bytes (meaning, values from 0-255) in memory. The rules -for translating a Unicode string into a sequence of bytes are called an -**encoding**. - -The first encoding you might think of is an array of 32-bit integers. In this -representation, the string "Python" would look like this:: +.. + Strictly, these definitions imply that it's meaningless to say 'this is + character U+12ca'. U+12ca is a code point, which represents some particular + character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In + informal contexts, this distinction between code points and characters will + sometimes be forgotten. + +厳密にいうとこれらの定義は「この文字は U+12ca です」ということを意味してい ません。 +U+12ca はコードポイントで、特定の文字を示しています; この場合で は、'ETHIOPIC SYLLABLE WI' を示しています。 +細かく気にしない文脈の中ではコードポイントと文字の区別は忘れられることがよ くあります。 + +.. + A character is represented on a screen or on paper by a set of graphical + elements that's called a **glyph**. The glyph for an uppercase A, for example, + is two diagonal strokes and a horizontal stroke, though the exact details will + depend on the font being used. Most Python code doesn't need to worry about + glyphs; figuring out the correct glyph to display is generally the job of a GUI + toolkit or a terminal's font renderer. + +文字は画面や紙面上では **グリフ (glyph)** と呼ばれるグラフィック要素の組で 表示されます。 +大文字の A のグリフは例えば、厳密な形は使っているフォントによって異なります が、斜めの線と水平の線です。 +たいていの Python コードではグリフの心配をする必要はありません; +一般的には表示する正しいグリフを見付けることは GUI toolkit や端末のフォント レンダラーの仕事です。 + +.. + Encodings + --------- + +エンコーディング +---------------- + +.. + To summarize the previous section: a Unicode string is a sequence of code + points, which are numbers from 0 to 0x10ffff. This sequence needs to be + represented as a set of bytes (meaning, values from 0-255) in memory. The rules + for translating a Unicode string into a sequence of bytes are called an + **encoding**. + +前の節をまとめると: Unicode 文字列は 0 から 0x10ffff までの数値であるコード ポイントのシーケンスで、 +シーケンスはメモリ上でバイト (0 から 255 までの値) の組として表現される必要 があります。 +バイト列を Unicode 文字列に変換する規則を **エンコーディング (encoding)** と呼びます。 + +.. + The first encoding you might think of is an array of 32-bit integers. In this + representation, the string "Python" would look like this:: + +最初に考えるであろうエンコーディングは 32-bit 整数の配列でしょう。 +この表示では、"Python" という文字列はこうみえます:: P y t h o n 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -This representation is straightforward but using it presents a number of -problems. - -1. It's not portable; different processors order the bytes differently. - -2. It's very wasteful of space. In most texts, the majority of the code points - are less than 127, or less than 255, so a lot of space is occupied by zero - bytes. The above string takes 24 bytes compared to the 6 bytes needed for an - ASCII representation. Increased RAM usage doesn't matter too much (desktop - computers have megabytes of RAM, and strings aren't usually that large), but - expanding our usage of disk and network bandwidth by a factor of 4 is - intolerable. - -3. It's not compatible with existing C functions such as ``strlen()``, so a new - family of wide string functions would need to be used. - -4. Many Internet standards are defined in terms of textual data, and can't - handle content with embedded zero bytes. - -Generally people don't use this encoding, instead choosing other encodings that -are more efficient and convenient. - -Encodings don't have to handle every possible Unicode character, and most -encodings don't. For example, Python's default encoding is the 'ascii' -encoding. The rules for converting a Unicode string into the ASCII encoding are -simple; for each code point: - -1. If the code point is < 128, each byte is the same as the value of the code - point. - -2. If the code point is 128 or greater, the Unicode string can't be represented - in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this - case.) - -Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points -0-255 are identical to the Latin-1 values, so converting to this encoding simply -requires converting code points to byte values; if a code point larger than 255 -is encountered, the string can't be encoded into Latin-1. - -Encodings don't have to be simple one-to-one mappings like Latin-1. Consider -IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one -block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145 -through 153. If you wanted to use EBCDIC as an encoding, you'd probably use -some sort of lookup table to perform the conversion, but this is largely an -internal detail. - -UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode -Transformation Format", and the '8' means that 8-bit numbers are used in the -encoding. (There's also a UTF-16 encoding, but it's less frequently used than -UTF-8.) UTF-8 uses the following rules: - -1. If the code point is <128, it's represented by the corresponding byte value. -2. If the code point is between 128 and 0x7ff, it's turned into two byte values - between 128 and 255. -3. Code points >0x7ff are turned into three- or four-byte sequences, where each - byte of the sequence is between 128 and 255. - -UTF-8 has several convenient properties: - -1. It can handle any Unicode code point. -2. A Unicode string is turned into a string of bytes containing no embedded zero - bytes. This avoids byte-ordering issues, and means UTF-8 strings can be - processed by C functions such as ``strcpy()`` and sent through protocols that - can't handle zero bytes. -3. A string of ASCII text is also valid UTF-8 text. -4. UTF-8 is fairly compact; the majority of code points are turned into two - bytes, and values less than 128 occupy only a single byte. -5. If bytes are corrupted or lost, it's possible to determine the start of the - next UTF-8-encoded code point and resynchronize. It's also unlikely that - random 8-bit data will look like valid UTF-8. - - - -References ----------- - -The Unicode Consortium site at <http://www.unicode.org> has character charts, a -glossary, and PDF versions of the Unicode specification. Be prepared for some -difficult reading. <http://www.unicode.org/history/> is a chronology of the -origin and development of Unicode. - -To help understand the standard, Jukka Korpela has written an introductory guide -to reading the Unicode character tables, available at -<http://www.cs.tut.fi/~jkorpela/unicode/guide.html>. - -Another good introductory article was written by Joel Spolsky -<http://www.joelonsoftware.com/articles/Unicode.html>. -If this introduction didn't make things clear to you, you should try reading this -alternate article before continuing. +.. + This representation is straightforward but using it presents a number of + problems. + +この表示は直接的でわかりやすい方法ですが、この表示を使うにはいくつかの問題 があります。 + +.. + 1. It's not portable; different processors order the bytes differently. + + 2. It's very wasteful of space. In most texts, the majority of the code points + are less than 127, or less than 255, so a lot of space is occupied by zero + bytes. The above string takes 24 bytes compared to the 6 bytes needed for an + ASCII representation. Increased RAM usage doesn't matter too much (desktop + computers have megabytes of RAM, and strings aren't usually that large), but + expanding our usage of disk and network bandwidth by a factor of 4 is + intolerable. + + 3. It's not compatible with existing C functions such as ``strlen()``, so a new + family of wide string functions would need to be used. + + 4. Many Internet standards are defined in terms of textual data, and can't + handle content with embedded zero bytes. + +1. 可搬性がない; プロセッサが異なるとバイトの順序づけも変わってしまいます。 + +2. 空間を無駄に使ってしまいます。 + 多くの文書では、コードポイントの多くは 127 か 255 より小さいため多くの空 間が + ゼロバイトに占有されます。 + 上の文字列はASCII表示では6バイトを必要だったのに対して24バイトを必要とし ています。 + RAM の使用料の増加はたいした問題ではありませんが + (デスクトップコンピュータは RAM をメガバイト単位で持っていますし、 + 文字列はそこまで大きい容量にはなりません)、 + しかし、ディスクとネットワークの帯域が4倍増えることはとても我慢できるも のではありません。 + +3. ``strlen()`` のような現存する C 関数と互換性がありません、 + そのためワイド文字列関数一式が新たに必要となります。 + +4. 多くのインターネット標準がテキストデータとして定義されていて、 + それらはゼロバイトの埋め込まれた内容を扱うことができません。 + +.. + generally people don't use this encoding, instead choosing other encodings that + are more efficient and convenient. + +一般的にこのエンコーディングは使わず、変わりにより効率的で便利な他のエン コーディングが選ばれています。 + +.. + Encodings don't have to handle every possible Unicode character, and most + encodings don't. For example, Python's default encoding is the 'ascii' + encoding. The rules for converting a Unicode string into the ASCII encoding are + simple; for each code point: + +エンコーディングは全ての Unicode 文字を扱う必要はありませんし、多くのエン コーディングはそれをしません。 +例えば Python のデフォルトエンコーディングの 'ascii' エンコーディング。 +Unicode 文字列を ASCII エンコーディングに変換する規則は単純です; それぞれの コードポイントに対して: + +.. + 1. If the code point is < 128, each byte is the same as the value of the code + point. + + 2. If the code point is 128 or greater, the Unicode string can't be represented + in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this + case.) + +1. コードポイントは128より小さい場合、コードポイントと同じ値 + +2. コードポイントが128以上の場合、Unicode 文字列はエンコーディングで表示す ることができません。 + (この場合 Python は :exc:`UnicodeEncodeError` 例外を送出します。) + +.. + Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points + 0-255 are identical to the Latin-1 values, so converting to this encoding simply + requires converting code points to byte values; if a code point larger than 255 + is encountered, the string can't be encoded into Latin-1. + +Latin-1, ISO-8859-1 として知られるエンコーディングも同様のエンコーディング です。 +Unicode コードポイントの 0-255 は Latin-1 の値と等価なので、このエンコーデ ィングの変換するには、 +単純にコードポイントをバイト値に変換する必要があります; +もしコードポイントが255より大きい場合に遭遇した場合、文字列は Latin-1 にエ ンコードできません。 + +.. + Encodings don't have to be simple one-to-one mappings like Latin-1. Consider + IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one + block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145 + through 153. If you wanted to use EBCDIC as an encoding, you'd probably use + some sort of lookup table to perform the conversion, but this is largely an + internal detail. + +エンコーディングは Latin-1 のように単純な一対一対応を持っていません。 +IBM メインフレームで使われていた IBM の EBCDIC で考えてみます。 +文字は一つのブロックに収められていませんでした: 'a' から 'i' は 129 から 137 まででしたが、 +'j' から 'r' までは 145 から 153 までした。 +EBICIC を使いたいと思ったら、おそらく変換を実行するルックアップテーブルの類 を使う必要があるでしょう、 +これは内部の詳細のことになりますが。 + +.. + UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode + Transformation Format", and the '8' means that 8-bit numbers are used in the + encoding. (There's also a UTF-16 encoding, but it's less frequently used than + UTF-8.) UTF-8 uses the following rules: + +UTF-8 は最もよく使われるエンコーディングの一つです. +UTF は "Unicode Transformation Format" からとられていて、 +8 はエンコーディングに 8-bit の数字を使うことを意味しています。 +(同じく UTF-16 エンコーディングもあります、しかしこちらは UTF-8 ほど頻繁に 使われていません。) +UTF-8 は以下の規則を利用します: + +.. + 1. If the code point is <128, it's represented by the corresponding byte value. + 2. If the code point is between 128 and 0x7ff, it's turned into two byte values + between 128 and 255. + 3. Code points >0x7ff are turned into three- or four-byte sequences, where each + byte of the sequence is between 128 and 255. + +1. コードポイントが128より小さい場合、対応するバイト値で表現。 +2. コードポイントは128から0x7ff の間の場合、128から255までの2バイト値に変 換。 +3. 0x7ff より大きいコードポイントは3か4バイト列に変換し、バイト列のそれぞれ のバイトは128から255の間をとる。 + +.. + UTF-8 has several convenient properties: + +UTF-8 はいくつかの便利な性質を持っています。 + +.. + 1. It can handle any Unicode code point. + 2. A Unicode string is turned into a string of bytes containing no embedded zero + bytes. This avoids byte-ordering issues, and means UTF-8 strings can be + processed by C functions such as ``strcpy()`` and sent through protocols that + can't handle zero bytes. + 3. A string of ASCII text is also valid UTF-8 text. + 4. UTF-8 is fairly compact; the majority of code points are turned into two + bytes, and values less than 128 occupy only a single byte. + 5. If bytes are corrupted or lost, it's possible to determine the start of the + next UTF-8-encoded code point and resynchronize. It's also unlikely that + random 8-bit data will look like valid UTF-8. + +1. 任意の Unicode コードポイントを扱うことができる。 +2. Unicode 文字列をゼロバイトで埋めないバイト文字列に変換する。 + これによってバイト順の問題を解決し、UTF-8 文字列を ``strcpy()`` のよう な C 関数で処理することができ、 + そしてゼロバイトを扱うことができないプロトコル経由で送信することができま す。 +3. ASCII テキストの文字列は UTF-8 テキストとしても有効です。 +4. UTF-8 はかなりコンパクトです; コードポイントの多くは2バイトに変換され、 + 値が128より小さければ、1バイトしか占有しません。 +5. バイトが欠落したり、失われた場合、次の UTF-8 でエンコードされたコードポ イントの開始を決定し、 + 再同期することができる可能性があります。 + 同様の理由でランダムな 8-bit データは正当な UTF-8 とみなされにくくなって います。 + +.. + References + ---------- + +参考文献 +-------- + +.. + The Unicode Consortium site at <http://www.unicode.org> has character charts, a + glossary, and PDF versions of the Unicode specification. Be prepared for some + difficult reading. <http://www.unicode.org/history/> is a chronology of the + origin and development of Unicode. + +Unicode コンソーシアムのサイト <http://www.unicode.org> には文字の図表や用 語辞典、そして Unicode 仕様の PDF があります。 +読むのは簡単ではないので覚悟して下さい。 + +<http://www.unicode.org/history/> は Unicode の起源と発展の年表です。 + +.. + To help understand the standard, Jukka Korpela has written an introductory guide + to reading the Unicode character tables, available at + <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>. + +標準についての理解を助けるために Jukka Korpela が Unicode の文字表を読むた めの導入ガイドを書いています、 +<http://www.cs.tut.fi/~jkorpela/unicode/guide.html> から入手可能です。 + +.. + Another good introductory article was written by Joel Spolsky + <http://www.joelonsoftware.com/articles/Unicode.html>. + If this introduction didn't make things clear to you, you should try reading this + alternate article before continuing. + +もう一つのよい入門記事 <http://www.joelonsoftware.com/articles/Unicode.html> を +Joel Spolsky が書いています。 +もしこの HOWTO の入門が明解に感じなかった場合には、続きを読む前にこの記事を 読んでみるべきです。 .. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken -Wikipedia entries are often helpful; see the entries for "character encoding" -<http://en.wikipedia.org/wiki/Character_encoding> and UTF-8 -<http://en.wikipedia.org/wiki/UTF-8>, for example. - - -Python's Unicode Support -======================== - -Now that you've learned the rudiments of Unicode, we can look at Python's -Unicode features. - - -The Unicode Type ----------------- - -Unicode strings are expressed as instances of the :class:`unicode` type, one of -Python's repertoire of built-in types. It derives from an abstract type called -:class:`basestring`, which is also an ancestor of the :class:`str` type; you can -therefore check if a value is a string type with ``isinstance(value, -basestring)``. Under the hood, Python represents Unicode strings as either 16- -or 32-bit integers, depending on how the Python interpreter was compiled. - -The :func:`unicode` constructor has the signature ``unicode(string[, encoding, -errors])``. All of its arguments should be 8-bit strings. The first argument -is converted to Unicode using the specified encoding; if you leave off the -``encoding`` argument, the ASCII encoding is used for the conversion, so -characters greater than 127 will be treated as errors:: +.. + Wikipedia entries are often helpful; see the entries for "character encoding" + <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8 + <http://en.wikipedia.org/wiki/UTF-8>, for example. + +Wikipedia の記事はしばしば役に立ちます; 試しに "character encoding" +<http://en.wikipedia.org/wiki/Character_encoding> の記事と +UTF-8 <http://en.wikipedia.org/wiki/UTF-8> の記事を読んでみて下さい。 + +.. + Python's Unicode Support + ======================== + +Python の Unicode サポート +========================== + +.. + Now that you've learned the rudiments of Unicode, we can look at Python's + Unicode features. + +ここまでで Unicode の基礎を学びました、ここから Python の Unicode 機能に触 れます。 + +.. + The Unicode Type + ---------------- + +Unicode 型 +---------- + +.. + Unicode strings are expressed as instances of the :class:`unicode` type, one of + Python's repertoire of built-in types. It derives from an abstract type called + :class:`basestring`, which is also an ancestor of the :class:`str` type; you can + therefore check if a value is a string type with ``isinstance(value, + basestring)``. Under the hood, Python represents Unicode strings as either 16- + or 32-bit integers, depending on how the Python interpreter was compiled. + +Unicode 文字列は Python の組み込み型の一つ :class:`unicode` 型のインスタン スとして表現されます。 +:class:`basestring` と呼ばれる抽象クラスから派生しています、 :class:`str` 型の親戚でもあります; +そのため ``isinstance(value, basestring)`` で文字列型かどうか調べることがで きます。 +Python 内部では Unicode 文字列は16-bit, 32-bit 整数のどちらかで表現され、 +どちらが使われるかは Python インタプリタがどうコンパイルされたかに依存しま す。 + +.. + The :func:`unicode` constructor has the signature ``unicode(string[, encoding, + errors])``. All of its arguments should be 8-bit strings. The first argument + is converted to Unicode using the specified encoding; if you leave off the + ``encoding`` argument, the ASCII encoding is used for the conversion, so + characters greater than 127 will be treated as errors:: + +:func:`unicode` コンストラクタは ``unicode(string[, encoding, errors])`` と いう用法を持っています。 +この引数は全て 8-bit 文字列でなければいけません。 +最初の引数は指定したエンコーディングを使って Unicode に変換されます; +``encoding`` 引数を渡さない場合、変換には ASCII エンコーディングが使われま す、 +そのため 127 より大きい文字はエラーとして扱われます:: >>> unicode('abcdef') u'abcdef' @@ -256,11 +494,18 @@ UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: ordinal not in range(128) -The ``errors`` argument specifies the response when the input string can't be -converted according to the encoding's rules. Legal values for this argument are -'strict' (raise a ``UnicodeDecodeError`` exception), 'replace' (add U+FFFD, -'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the -Unicode result). The following examples show the differences:: +.. + The ``errors`` argument specifies the response when the input string can't be + converted according to the encoding's rules. Legal values for this argument are + 'strict' (raise a ``UnicodeDecodeError`` exception), 'replace' (add U+FFFD, + 'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the + Unicode result). The following examples show the differences:: + +``errors`` 引数は入力文字列がエンコーディング規則に従って変換できないときの 対応を指定します。 +この引数に有効な値は 'strict' (``UnicodeDecodeError`` を送出する)、 +'replace' (U+FFFD, 'REPLACEMENT CHARACTER' を追加する)、 +または 'ignore' (結果の Unicode 文字列から文字を除くだけ) です。 +以下の例で違いを示します:: >>> unicode('\x80abc', errors='strict') Traceback (most recent call last): @@ -272,25 +517,41 @@ >>> unicode('\x80abc', errors='ignore') u'abc' -Encodings are specified as strings containing the encoding's name. Python 2.4 -comes with roughly 100 different encodings; see the Python Library Reference at -:ref:`standard-encodings` for a list. Some encodings -have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all -synonyms for the same encoding. - -One-character Unicode strings can also be created with the :func:`unichr` -built-in function, which takes integers and returns a Unicode string of length 1 -that contains the corresponding code point. The reverse operation is the -built-in :func:`ord` function that takes a one-character Unicode string and -returns the code point value:: +.. + Encodings are specified as strings containing the encoding's name. Python 2.4 + comes with roughly 100 different encodings; see the Python Library Reference at + :ref:`standard-encodings` for a list. Some encodings + have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all + synonyms for the same encoding. + +エンコーディングはエンコーディング名を含む文字列によって指定されます。 +Python 2.4 ではエンコーディングはおよそ100に及びます; +一覧は Python ライブラリレファレンスの :ref:`standard-encodings` を参照して 下さい。 +いくつかのエンコーディングは複数の名前を持っています; 例え ば 'latin-1', 'iso_8859_1', +そして '8859' これらは全て同じエンコーディングの別称です。 + +.. + One-character Unicode strings can also be created with the :func:`unichr` + built-in function, which takes integers and returns a Unicode string of length 1 + that contains the corresponding code point. The reverse operation is the + built-in :func:`ord` function that takes a one-character Unicode string and + returns the code point value:: + +Unicode 文字列の一つの文字は :func:`unichr` 組み込み関数で作成することがで きます、 +この関数は整数を引数にとり、対応するコードポイントを含む長さ1の Unicode 文 字列を返します。 +逆の操作は :func:`ord` 組み込み関数です、この関数は一文字の Unicode 文字列 を引数にとり、 +コードポイント値を返します:: >>> unichr(40960) u'\ua000' >>> ord(u'\ua000') 40960 -Instances of the :class:`unicode` type have many of the same methods as the -8-bit string type for operations such as searching and formatting:: +.. + Instances of the :class:`unicode` type have many of the same methods as the + 8-bit string type for operations such as searching and formatting:: + +:class:`unicode` 型のインスタンスは多くの 8-bit 文字列型と同じ検索や書式指 定のためのメソッドを持っています:: >>> s = u'Was ever feather so lightly blown to and fro as this multitude?' >>> s.count('e') @@ -304,10 +565,15 @@ >>> s.upper() u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?' -Note that the arguments to these methods can be Unicode strings or 8-bit -strings. 8-bit strings will be converted to Unicode before carrying out the -operation; Python's default ASCII encoding will be used, so characters greater -than 127 will cause an exception:: +.. + Note that the arguments to these methods can be Unicode strings or 8-bit + strings. 8-bit strings will be converted to Unicode before carrying out the + operation; Python's default ASCII encoding will be used, so characters greater + than 127 will cause an exception:: + +これらのメソッドの引数は Unicode 文字列または 8-bit 文字列が使えることに注 意して下さい。 +8-bit 文字列は操作に使われる前に Unicode に変換されます; +Python デフォルトの ASCII エンコーディングが利用されるため、127より大きい文 字列は例外を引き起します:: >>> s.find('Was\x9f') Traceback (most recent call last): @@ -316,16 +582,27 @@ >>> s.find(u'Was\x9f') -1 -Much Python code that operates on strings will therefore work with Unicode -strings without requiring any changes to the code. (Input and output code needs -more updating for Unicode; more on this later.) - -Another important method is ``.encode([encoding], [errors='strict'])``, which -returns an 8-bit string version of the Unicode string, encoded in the requested -encoding. The ``errors`` parameter is the same as the parameter of the -``unicode()`` constructor, with one additional possibility; as well as 'strict', -'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's -character references. The following example shows the different results:: +.. + Much Python code that operates on strings will therefore work with Unicode + strings without requiring any changes to the code. (Input and output code needs + more updating for Unicode; more on this later.) + +文字列操作を行なう多くの Python コードはコードの変更無しに Unicode 文字列を 扱うことができるでしょう。 +(入出力に関しては Unicode のための更新が必要になります; 詳しくは後で述べま す。) + +.. + Another important method is ``.encode([encoding], [errors='strict'])``, which + returns an 8-bit string version of the Unicode string, encoded in the requested + encoding. The ``errors`` parameter is the same as the parameter of the + ``unicode()`` constructor, with one additional possibility; as well as 'strict', + 'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's + character references. The following example shows the different results:: + +別の重要なメソッドは ``.encode([encoding], [errors='strict'])`` がありま す、 +このメソッドは Unicode 文字列を要求したエンコーディングでエンコードされた 8-bit 文字列を返します。 +``errors`` パラメータは ``unicode()`` コンストラクタのパラメータと同様です が、 +もう一つ可能性が追加されています; 同様のものとして 'strict', 'ignore', そし て 'replace' があり、 +さらに XML 文字参照を使う 'xmlcharrefreplace' を渡すことができます:: >>> u = unichr(40960) + u'abcd' + unichr(1972) >>> u.encode('utf-8') @@ -341,8 +618,12 @@ >>> u.encode('ascii', 'xmlcharrefreplace') 'ꀀabcd޴' -Python's 8-bit strings have a ``.decode([encoding], [errors])`` method that -interprets the string using the given encoding:: +.. + Python's 8-bit strings have a ``.decode([encoding], [errors])`` method that + interprets the string using the given encoding:: + +Python の 8-bit 文字列は ``.decode([encoding], [errors])`` メソッドを持って います、 +これは与えたエンコーディングを使って文字列を解釈します:: >>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string >>> utf8_version = u.encode('utf-8') # Encode as UTF-8 @@ -352,31 +633,60 @@ >>> u == u2 # The two strings match True -The low-level routines for registering and accessing the available encodings are -found in the :mod:`codecs` module. However, the encoding and decoding functions -returned by this module are usually more low-level than is comfortable, so I'm -not going to describe the :mod:`codecs` module here. If you need to implement a -completely new encoding, you'll need to learn about the :mod:`codecs` module -interfaces, but implementing encodings is a specialized task that also won't be -covered here. Consult the Python documentation to learn more about this module. - -The most commonly used part of the :mod:`codecs` module is the -:func:`codecs.open` function which will be discussed in the section on input and -output. - - -Unicode Literals in Python Source Code --------------------------------------- - -In Python source code, Unicode literals are written as strings prefixed with the -'u' or 'U' character: ``u'abcdefghijk'``. Specific code points can be written -using the ``\u`` escape sequence, which is followed by four hex digits giving -the code point. The ``\U`` escape sequence is similar, but expects 8 hex -digits, not 4. - -Unicode literals can also use the same escape sequences as 8-bit strings, -including ``\x``, but ``\x`` only takes two hex digits so it can't express an -arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777. +.. + The low-level routines for registering and accessing the available encodings are + found in the :mod:`codecs` module. However, the encoding and decoding functions + returned by this module are usually more low-level than is comfortable, so I'm + not going to describe the :mod:`codecs` module here. If you need to implement a + completely new encoding, you'll need to learn about the :mod:`codecs` module + interfaces, but implementing encodings is a specialized task that also won't be + covered here. Consult the Python documentation to learn more about this module. + +:mod:`codecs` モジュールに利用可能なエンコーディングを登録したり、アクセス する低レベルルーチンがあります。 +しかし、このモジュールが返すエンコーディングとデコーディング関数はふつう低 レベルすぎて快適とはいえません、 +そのためここで :mod:`codecs` モジュールについて述べないことにします。 +もし、全く新しいエンコーディングを実装する必要があれば、 +:mod:`codecs` モジュールのインターフェースについて学ぶ必要があります、 +しかし、エンコーディングの実装は特殊な作業なので、ここでは扱いません。 +このモジュールについて学ぶには Python ドキュメントを参照して下さい。 + +.. + The most commonly used part of the :mod:`codecs` module is the + :func:`codecs.open` function which will be discussed in the section on input and + output. + + +:mod:`codecs` モジュールの中で最も使われるのは :func:`codecs.open` 関数で す、 +この関数は入出力の節で議題に挙げます。 + +.. + Unicode Literals in Python Source Code + -------------------------------------- + +Python ソースコード内の Unicode リテラル +---------------------------------------- + +.. + In Python source code, Unicode literals are written as strings prefixed with the + 'u' or 'U' character: ``u'abcdefghijk'``. Specific code points can be written + using the ``\u`` escape sequence, which is followed by four hex digits giving + the code point. The ``\U`` escape sequence is similar, but expects 8 hex + digits, not 4. + +Python のソースコード内では Unicode リテラルは 'u' または 'U' の文字を最初 に付けた文字列として書かれます: +``u'abcdefghijk'`` 。 +特定のコードポイントはエスケープシーケンス ``\u`` を使い、続けてコードポイ ントを4桁の16進数を書きます。 +エスケープシーケンス ``\U`` も同様です、ただし4桁ではなく8桁の16進数を使い ます。 + +.. + Unicode literals can also use the same escape sequences as 8-bit strings, + including ``\x``, but ``\x`` only takes two hex digits so it can't express an + arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777. + +Unicode リテラルは 8-bit 文字列と同じエスケープシーケンスを使うことができま す、 +使えるエスケープシーケンスには ``\x`` も含みます、ただし ``\x`` は2桁の16進 数しかとることができないので +任意のコードポイントを表現することはできません。 +8進エスケープは8進数の777を示す U+01ff まで使うことができます。 :: @@ -388,20 +698,35 @@ ... 97 172 4660 8364 32768 -Using escape sequences for code points greater than 127 is fine in small doses, -but becomes an annoyance if you're using many accented characters, as you would -in a program with messages in French or some other accent-using language. You -can also assemble strings using the :func:`unichr` built-in function, but this is -even more tedious. - -Ideally, you'd want to be able to write literals in your language's natural -encoding. You could then edit Python source code with your favorite editor -which would display the accented characters naturally, and have the right -characters used at runtime. - -Python supports writing Unicode literals in any encoding, but you have to -declare the encoding being used. This is done by including a special comment as -either the first or second line of the source file:: +.. + Using escape sequences for code points greater than 127 is fine in small doses, + but becomes an annoyance if you're using many accented characters, as you would + in a program with messages in French or some other accent-using language. You + can also assemble strings using the :func:`unichr` built-in function, but this is + even more tedious. + +127 より大きいコードポイントに対してエスケープシーケンスを使うのはあまり多 くないうちは有効ですが、 +フランス語等のアクセントを使う言語でメッセージのような多くのアクセント文字 を使う場合には邪魔になります。 +文字を :func:`unichr` 組み込み関数を使って組み上げることもできますが、それ はさらに長くなってしまうでしょう。 + +.. + Ideally, you'd want to be able to write literals in your language's natural + encoding. You could then edit Python source code with your favorite editor + which would display the accented characters naturally, and have the right + characters used at runtime. + +理想的にはあなたの言語の自然なエンコーディングでリテラルを書くことでしょ う。 +そうなれば、Python のソースコードをアクセント付きの文字を自然に表示するお気 に入りのエディタで編集し、 +実行時に正しい文字が得られます。 + +.. + Python supports writing Unicode literals in any encoding, but you have to + declare the encoding being used. This is done by including a special comment as + either the first or second line of the source file:: + +Python は Unicode 文字列を任意のエンコーディングで書くことができます、 +ただしどのエンコーディングを使うかを宣言しなければいけません。 +それはソースファイルの一行目や二行目に特別なコメントを含めることによってで きます:: #!/usr/bin/env python ***The diff for this file has been truncated for email.*** ============================================================================== Revision: e94aea17f93c Author: Akihiro Uchida <uchid****@ike-d*****> Date: Fri May 20 00:15:24 2011 Log: merge http://code.google.com/p/python-doc-ja/source/detail?r=e94aea17f93c