forked from diveintomark/diveintopython3
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathstrings.html
executable file
·464 lines (365 loc) · 46.1 KB
/
strings.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
<!DOCTYPE html>
<meta charset=utf-8>
<title>Strings - Dive Into Python 3</title>
<!--[if IE]><script src=j/html5.js></script><![endif]-->
<link rel=stylesheet href=dip3.css>
<style>
body{counter-reset:h1 4}
</style>
<link rel=stylesheet media='only screen and (max-device-width: 480px)' href=mobile.css>
<link rel=stylesheet media=print href=print.css>
<meta name=viewport content='initial-scale=1.0'>
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input type=search name=q size=25 placeholder="powered by Google™"> <input type=submit name=sa value=Search></div></form>
<p>You are here: <a href=index.html>Home</a> <span class=u>‣</span> <a href=table-of-contents.html#strings>Dive Into Python 3</a> <span class=u>‣</span>
<p id=level>Difficulty level: <span class=u title=intermediate>♦♦♦♢♢</span>
<h1>Strings</h1>
<blockquote class=q>
<p><span class=u>❝</span> I’m telling you this ’cause you’re one of my friends.<br>
My alphabet starts where your alphabet ends! <span class=u>❞</span><br>— Dr. Seuss, On Beyond Zebra!
</blockquote>
<p id=toc>
<h2 id=boring-stuff>Some Boring Stuff You Need To Understand Before You Can Dive In</h2>
<p class=f>Few people think about it, but text is incredibly complicated. Start with the alphabet. The people of <a href=http://en.wikipedia.org/wiki/Bougainville_Province>Bougainville</a> have the smallest alphabet in the world; their <a href=http://en.wikipedia.org/wiki/Rotokas_alphabet>Rotokas alphabet</a> is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course, has 26 letters — 52 if you count uppercase and lowercase separately — plus a handful of <i class=baa>!@#$%&</i> punctuation marks.
<p>When you talk about “text,” you’re probably thinking of “characters and symbols on my computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular <i>character encoding</i>. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages.
<p>In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes — a file, a web page, whatever — and claims it’s “text,” you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you’re left with the unenviable task of cracking the code yourself. Chances are you’ll get it wrong, and the result will be gibberish.
<aside>Everything you thought you knew about strings is wrong.</aside>
<p>Surely you’ve seen web pages like this, with strange question-mark-like characters where apostrophes should be. That usually means the page author didn’t declare their character encoding correctly, your browser was left guessing, and the result was a mix of expected and unexpected characters. In English it’s merely annoying; in other languages, the result can be completely unreadable.
<p>There are character encodings for each major language in the world. Since each language is different, and memory and disk space have historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0–255) to represent that language’s characters. For instance, you’re probably familiar with the <abbr>ASCII</abbr> encoding, which stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, <i class=baa>&</i>c.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that’s 7 out of the 8 bits in a byte.
<p>Western European languages like French, Spanish, and German have more letters than English. Or, more precisely, they have letters combined with various diacritical marks, like the <code>ñ</code> character in Spanish. The most common encoding for these languages is CP-1252, also called “windows-1252” because it is widely used on Microsoft Windows. The CP-1252 encoding shares characters with <abbr>ASCII</abbr> in the 0–127 range, but then extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252), <i class=baa>&</i>c. It’s still a single-byte encoding, though; the highest possible number, 255, still fits in one byte.
<p>Then there are languages like Chinese, Japanese, and Korean, which have so many characters that they require multiple-byte character sets. That is, each “character” is represented by a two-byte number from 0–65535. But different multi-byte encodings still share the same problem as different single-byte encodings, namely that they each use the same numbers to mean different things. It’s just that the range of numbers is broader, because there are many more characters to represent.
<p>That was mostly OK in a non-networked world, where “text” was something you typed yourself and occasionally printed. There wasn’t much “plain text”. Source code was <abbr>ASCII</abbr>, and everyone else used word processors, which defined their own (non-text) formats that tracked character encoding information along with rich styling, <i class=baa>&</i>c. People read these documents with the same word processing program as the original author, so everything worked, more or less.
<p>Now think about the rise of global networks like email and the web. Lots of “plain text” flying around the globe, being authored on one computer, transmitted through a second computer, and received and displayed by a third computer. Computers can only see numbers, but the numbers could mean different things. Oh no! What to do? Well, systems had to be designed to carry encoding information along with every piece of “plain text.” Remember, it’s the decryption key that maps computer-readable numbers to human-readable characters. A missing decryption key means garbled text, gibberish, or worse.
<p>Now think about trying to store multiple pieces of text in the same place, like in the same database table that holds all the email you’ve ever received. You still need to store the character encoding alongside each piece of text so you can display it properly. Think that’s hard? Try searching your email database, which means converting between multiple encodings on the fly. Doesn’t that sound fun?
<p>Now think about the possibility of multilingual documents, where characters from several languages are next to each other in the same document. (Hint: programs that tried to do this typically used escape codes to switch “modes.” Poof, you’re in Russian koi8-r mode, so 241 means Я; poof, now you’re in Mac Greek mode, so 241 means ώ.) And of course you’ll want to search <em>those</em> documents, too.
<p>Now cry a lot, because everything you thought you knew about strings is wrong, and there ain’t no such thing as “plain text.”
<p class=a>⁂
<h2 id=one-ring-to-rule-them-all>Unicode</h2>
<p><i>Enter <dfn>Unicode</dfn>.</i>
<p>Unicode is a system designed to represent <em>every</em> character from <em>every</em> language. Unicode represents each letter, character, or ideograph as a 4-byte number. Each number represents a unique character used in at least one of the world’s languages. (Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn’t be sufficient.) Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep track of. <code>U+0041</code> is always <code>'A'</code>, even if your language doesn’t have an <code>'A'</code> in it.
<p>On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more “mode switching” to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character<span class=u title='interrobang!'>‽</span> That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it’s wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character.
<p>There is a Unicode encoding that uses four bytes per character. It’s called UTF-32, because 32 bits = 4 bytes. UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and represents the character with that same number. This has some advantages, the most important being that you can find the <var>Nth</var> character of a string in constant time, because the <var>Nth</var> character starts at the <var>4×Nth</var> byte. It also has several disadvantages, the most obvious being that it takes four freaking bytes to store every freaking character.
<p>Even though there are a lot of Unicode characters, it turns out that most people will never use anything beyond the first 65535. Thus, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes). UTF-16 encodes every character from 0–65535 as two bytes, then uses some dirty hacks if you actually need to represent the rarely-used “astral plane” Unicode characters beyond 65535. Most obvious advantage: UTF-16 is twice as space-efficient as UTF-32, because every character requires only two bytes to store instead of four bytes (except for the ones that don’t). And you can still easily find the <var>Nth</var> character of a string in constant time, if you assume that the string doesn’t include any astral plane characters, which is a good assumption right up until the moment that it’s not.
<p>But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store individual bytes in different ways. That means that the character <code>U+4E2D</code> could be stored in UTF-16 as either <code>4E 2D</code> or <code>2D 4E</code>, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even more possible byte orderings.) As long as your documents never leave your computer, you’re safe — different applications on the same computer will all use the same byte order. But the minute you want to transfer documents between systems, perhaps on a world wide web of some sort, you’re going to need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of knowing whether the two-byte sequence <code>4E 2D</code> means <code>U+4E2D</code> or <code>U+2D4E</code>.
<p>To solve <em>this</em> problem, the multi-byte Unicode encodings define a “Byte Order Mark,” which is a special non-printable character that you can include at the beginning of your document to indicate what order your bytes are in. For UTF-16, the Byte Order Mark is <code>U+FEFF</code>. If you receive a UTF-16 document that starts with the bytes <code>FF FE</code>, you know the byte ordering is one way; if it starts with <code>FE FF</code>, you know the byte ordering is reversed.
<p>Still, UTF-16 isn’t exactly ideal, especially if you’re dealing with a lot of <abbr>ASCII</abbr> characters. If you think about it, even a Chinese web page is going to contain a lot of <abbr>ASCII</abbr> characters — all the elements and attributes surrounding the printable Chinese characters. Being able to find the <var>Nth</var> character in constant time is nice, but there’s still the nagging problem of those astral plane characters, which mean that you can’t <em>guarantee</em> that every character is exactly two bytes, so you can’t <em>really</em> find the <var>Nth</var> character in constant time unless you maintain a separate index. And boy, there sure is a lot of <abbr>ASCII</abbr> text in the world…
<p>Other people pondered these questions, and they came up with a solution:
<p class=xxxl>UTF-8
<p>UTF-8 is a <em>variable-length</em> encoding system for Unicode. That is, different characters take up a different number of bytes. For <abbr>ASCII</abbr> characters (A-Z, <i class=baa>&</i>c.) <abbr>UTF-8</abbr> uses just one byte per character. In fact, it uses the exact same bytes; the first 128 characters (0–127) in <abbr>UTF-8</abbr> are indistinguishable from <abbr>ASCII</abbr>. “Extended Latin” characters like ñ and ö end up taking two bytes. (The bytes are not simply the Unicode code point like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like 中 end up taking three bytes. The rarely-used “astral plane” characters take four bytes.
<p>Disadvantages: because each character can take a different number of bytes, finding the <var>Nth</var> character is an O(N) operation — that is, the longer the string, the longer it takes to find a specific character. Also, there is bit-twiddling involved to encode characters into bytes and decode bytes into characters.
<p>Advantages: super-efficient encoding of common <abbr>ASCII</abbr> characters. No worse than UTF-16 for extended Latin characters. Better than UTF-32 for Chinese characters. Also (and you’ll have to trust me on this, because I’m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in <abbr>UTF-8</abbr> uses the exact same stream of bytes on any computer.
<p class=a>⁂
<h2 id=divingin>Diving In</h2>
<p>In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in <abbr>UTF-8</abbr>, or a Python string encoded as CP-1252. “Is this string <abbr>UTF-8</abbr>?” is an invalid question. <abbr>UTF-8</abbr> is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>s = '深入 Python'</kbd> <span class=u>①</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>len(s)</kbd> <span class=u>②</span></a>
<samp class=pp>9</samp>
<a><samp class=p>>>> </samp><kbd class=pp>s[0]</kbd> <span class=u>③</span></a>
<samp class=pp>'深'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>s + ' 3'</kbd> <span class=u>④</span></a>
<samp class=pp>'深入 Python 3'</samp></pre>
<ol>
<li>To create a string, enclose it in quotes. Python strings can be defined with either single quotes (<code>'</code>) or double quotes (<code>"</code>).<!--"-->
<li>The built-in <code><dfn>len</dfn>()</code> function returns the length of the string, <i>i.e.</i> the number of characters. This is the same function you use to <a href=native-datatypes.html#extendinglists>find the length of a list, tuple, set, or dictionary</a>. A string is like a tuple of characters.
<li>Just like getting individual items out of a list, you can get individual characters out of a string using index notation.
<li>Just like lists, you can <dfn>concatenate</dfn> strings using the <code>+</code> operator.
</ol>
<p class=a>⁂
<h2 id=formatting-strings>Formatting Strings</h2>
<aside>Strings can be defined with either single or double quotes.</aside>
<p>Let’s take another look at <a href=your-first-python-program.html#divingin><code>humansize.py</code></a>:
<p class=d>[<a href=examples/humansize.py>download <code>humansize.py</code></a>]
<pre class=pp><code><a>SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'], <span class=u>①</span></a>
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
def approximate_size(size, a_kilobyte_is_1024_bytes=True):
<a> '''Convert a file size to human-readable form. <span class=u>②</span></a>
Keyword arguments:
size -- file size in bytes
a_kilobyte_is_1024_bytes -- if True (default), use multiples of 1024
if False, use multiples of 1000
Returns: string
<a> ''' <span class=u>③</span></a>
if size < 0:
<a> raise ValueError('number must be non-negative') <span class=u>④</span></a>
multiple = 1024 if a_kilobyte_is_1024_bytes else 1000
for suffix in SUFFIXES[multiple]:
size /= multiple
if size < multiple:
<a> return '{0:.1f} {1}'.format(size, suffix) <span class=u>⑤</span></a>
raise ValueError('number too large')</code></pre>
<ol>
<li><code>'KB'</code>, <code>'MB'</code>, <code>'GB'</code>… those are each strings.
<li>Function docstrings are strings. This docstring spans multiple lines, so it uses three-in-a-row quotes to start and end the string.
<li>These three-in-a-row quotes end the docstring.
<li>There’s another string, being passed to the exception as a human-readable error message.
<li>There’s a… whoa, what the heck is that?
</ol>
<p>Python 3 supports <dfn>formatting</dfn> values into strings. Although this can include very complicated expressions, the most basic usage is to insert a value into a string with a single placeholder.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>username = 'mark'</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>password = 'PapayaWhip'</kbd> <span class=u>①</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>"{0}'s password is {1}".format(username, password)</kbd> <span class=u>②</span></a>
<samp class=pp>"mark's password is PapayaWhip"</samp></pre>
<ol>
<li>No, my password is not really <kbd>PapayaWhip</kbd>.
<li>There’s a lot going on here. First, that’s a method call on a string literal. <em>Strings are objects</em>, and objects have methods. Second, the whole expression evaluates to a string. Third, <code>{0}</code> and <code>{1}</code> are <i>replacement fields</i>, which are replaced by the arguments passed to the <code><dfn>format</dfn>()</code> method.
</ol>
<h3 id=compound-field-names>Compound Field Names</h3>
<p>The previous example shows the simplest case, where the replacement fields are simply integers. Integer replacement fields are treated as positional indices into the argument list of the <code>format()</code> method. That means that <code>{0}</code> is replaced by the first argument (<var>username</var> in this case), <code>{1}</code> is replaced by the second argument (<var>password</var>), <i class=baa>&</i>c. You can have as many positional indices as you have arguments, and you can have as many arguments as you want. But replacement fields are much more powerful than that.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>import humansize</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>si_suffixes = humansize.SUFFIXES[1000]</kbd> <span class=u>①</span></a>
<samp class=p>>>> </samp><kbd class=pp>si_suffixes</kbd>
<samp class=pp>['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']</samp>
<a><samp class=p>>>> </samp><kbd class=pp>'1000{0[0]} = 1{0[1]}'.format(si_suffixes)</kbd> <span class=u>②</span></a>
<samp class=pp>'1000KB = 1MB'</samp>
</pre>
<ol>
<li>Rather than calling any function in the <code>humansize</code> module, you’re just grabbing one of the data structures it defines: the list of “SI” (powers-of-1000) suffixes.
<li>This looks complicated, but it’s not. <code>{0}</code> would refer to the first argument passed to the <code>format()</code> method, <var>si_suffixes</var>. But <var>si_suffixes</var> is a list. So <code>{0[0]}</code> refers to the first item of the list which is the first argument passed to the <code>format()</code> method: <code>'KB'</code>. Meanwhile, <code>{0[1]}</code> refers to the second item of the same list: <code>'MB'</code>. Everything outside the curly braces — including <code>1000</code>, the equals sign, and the spaces — is untouched. The final result is the string <code>'1000KB = 1MB'</code>.
</ol>
<aside>{0} is replaced by the 1<sup>st</sup> format() argument. {1} is replaced by the 2<sup>nd</sup>.</aside>
<p>What this example shows is that <em>format specifiers can access items and properties of data structures using (almost) Python syntax</em>. This is called <i>compound field names</i>. The following compound field names “just work”:
<ul>
<li>Passing a list, and accessing an item of the list by index (as in the previous example)
<li>Passing a dictionary, and accessing a value of the dictionary by key
<li>Passing a module, and accessing its variables and functions by name
<li>Passing a class instance, and accessing its properties and methods by name
<li><em>Any combination of the above</em>
</ul>
<p>Just to blow your mind, here’s an example that combines all of the above:
<pre class='nd screen'>
<samp class=p>>>> </samp><kbd class=pp>import humansize</kbd>
<samp class=p>>>> </samp><kbd class=pp>import sys</kbd>
<samp class=p>>>> </samp><kbd class=pp>'1MB = 1000{0.modules[humansize].SUFFIXES[1000][0]}'.format(sys)</kbd>
<samp class=pp>'1MB = 1000KB'</samp></pre>
<p>Here’s how it works:
<ul>
<li>The <code>sys</code> module holds information about the currently running Python instance. Since you just imported it, you can pass the <code>sys</code> module itself as an argument to the <code>format()</code> method. So the replacement field <code>{0}</code> refers to the <code>sys</code> module.
<li><code>sys.modules</code> is a dictionary of all the modules that have been imported in this Python instance. The keys are the module names as strings; the values are the module objects themselves. So the replacement field <code>{0.modules}</code> refers to the dictionary of imported modules.
<li><code>sys.modules['humansize']</code> is the <code>humansize</code> module which you just imported. The replacement field <code>{0.modules[humansize]}</code> refers to the <code>humansize</code> module. Note the slight difference in syntax here. In real Python code, the keys of the <code>sys.modules</code> dictionary are strings; to refer to them, you need to put quotes around the module name (<i>e.g.</i> <code>'humansize'</code>). But within a replacement field, you skip the quotes around the dictionary key name (<i>e.g.</i> <code>humansize</code>). To quote <a href=http://www.python.org/dev/peps/pep-3101/>PEP 3101: Advanced String Formatting</a>, “The rules for parsing an item key are very simple. If it starts with a digit, then it is treated as a number, otherwise it is used as a string.”
<li><code>sys.modules['humansize'].SUFFIXES</code> is the dictionary defined at the top of the <code>humansize</code> module. The replacement field <code>{0.modules[humansize].SUFFIXES}</code> refers to that dictionary.
<li><code>sys.modules['humansize'].SUFFIXES[1000]</code> is a list of <abbr>SI</abbr> suffixes: <code>['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']</code>. So the replacement field <code>{0.modules[humansize].SUFFIXES[1000]}</code> refers to that list.
<li><code>sys.modules['humansize'].SUFFIXES[1000][0]</code> is the first item of the list of <abbr>SI</abbr> suffixes: <code>'KB'</code>. Therefore, the complete replacement field <code>{0.modules[humansize].SUFFIXES[1000][0]}</code> is replaced by the two-character string <code>KB</code>.
</ul>
<h3 id=format-specifiers>Format Specifiers</h3>
<p>But wait! There’s more! Let’s take another look at that strange line of code from <code>humansize.py</code>:
<pre class='nd pp'><code>if size < multiple:
return '{0:.1f} {1}'.format(size, suffix)</code></pre>
<p><code>{1}</code> is replaced with the second argument passed to the <code>format()</code> method, which is <var>suffix</var>. But what is <code>{0:.1f}</code>? It’s two things: <code>{0}</code>, which you recognize, and <code>:.1f</code>, which you don’t. The second half (including and after the colon) defines the <i>format specifier</i>, which further refines how the replaced variable should be formatted.
<blockquote class='note compare clang'>
<p><span class=u>☞</span>Format specifiers allow you to munge the replacement text in a variety of useful ways, like the <code><dfn>printf</dfn>()</code> function in C. You can add zero- or space-padding, align strings, control decimal precision, and even convert numbers to hexadecimal.
</blockquote>
<p>Within a replacement field, a colon (<code>:</code>) marks the start of the format specifier. The format specifier “<code>.1</code>” means “round to the nearest tenth” (<i>i.e.</i> display only one digit after the decimal point). The format specifier “<code>f</code>” means “fixed-point number” (as opposed to exponential notation or some other decimal representation). Thus, given a <var>size</var> of <code>698.24</code> and <var>suffix</var> of <code>'GB'</code>, the formatted string would be <code>'698.2 GB'</code>, because <code>698.24</code> gets rounded to one decimal place, then the suffix is appended after the number.
<pre class='nd screen'>
<samp class=p>>>> </samp><kbd class=pp>'{0:.1f} {1}'.format(698.24, 'GB')</kbd>
<samp class=pp>'698.2 GB'</samp></pre>
<p>For all the gory details on format specifiers, consult the <a href=http://docs.python.org/3.1/library/string.html#format-specification-mini-language>Format Specification Mini-Language</a> in the official Python documentation.
<p class=a>⁂
<h2 id=common-string-methods>Other Common String Methods</h2>
<p>Besides formatting, strings can do a number of other useful tricks.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>s = '''Finished files are the re-</kbd> <span class=u>①</span></a>
<samp class=p>... </samp><kbd>sult of years of scientif-</kbd>
<samp class=p>... </samp><kbd>ic study combined with the</kbd>
<samp class=p>... </samp><kbd>experience of years.'''</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>s.splitlines()</kbd> <span class=u>②</span></a>
<samp class=pp>['Finished files are the re-',
'sult of years of scientif-',
'ic study combined with the',
'experience of years.']</samp>
<a><samp class=p>>>> </samp><kbd class=pp>print(s.lower())</kbd> <span class=u>③</span></a>
<samp>finished files are the re-
sult of years of scientif-
ic study combined with the
experience of years.</samp>
<a><samp class=p>>>> </samp><kbd class=pp>s.lower().count('f')</kbd> <span class=u>④</span></a>
<samp class=pp>6</samp></pre>
<ol>
<li>You can input <dfn>multiline</dfn> strings in the Python interactive shell. Once you start a multiline string with triple quotation marks, just hit <kbd>ENTER</kbd> and the interactive shell will prompt you to continue the string. Typing the closing triple quotation marks ends the string, and the next <kbd>ENTER</kbd> will execute the command (in this case, assigning the string to <var>s</var>).
<li>The <code><dfn>splitlines</dfn>()</code> method takes one multiline string and returns a list of strings, one for each line of the original. Note that the carriage returns at the end of each line are not included.
<li>The <code>lower()</code> method converts the entire string to lowercase. (Similarly, the <code>upper()</code> method converts a string to uppercase.)
<li>The <code>count()</code> method counts the number of occurrences of a substring. Yes, there really are six “f”s in that sentence!
</ol>
<p>Here’s another common case. Let’s say you have a list of key-value pairs in the form <code><var>key1</var>=<var>value1</var>&<var>key2</var>=<var>value2</var></code>, and you want to split them up and make a dictionary of the form <code>{key1: value1, key2: value2}</code>.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>query = 'user=pilgrim&database=master&password=PapayaWhip'</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>a_list = query.split('&')</kbd> <span class=u>①</span></a>
<samp class=p>>>> </samp><kbd class=pp>a_list</kbd>
<samp class=pp>['user=pilgrim', 'database=master', 'password=PapayaWhip']</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_list_of_lists = [v.split('=', 1) for v in a_list if '=' in v]</kbd> <span class=u>②</span></a>
<samp class=p>>>> </samp><kbd class=pp>a_list_of_lists</kbd>
<samp class=pp>[['user', 'pilgrim'], ['database', 'master'], ['password', 'PapayaWhip']]</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_dict = dict(a_list_of_lists)</kbd> <span class=u>③</span></a>
<samp class=p>>>> </samp><kbd class=pp>a_dict</kbd>
<samp class=pp>{'password': 'PapayaWhip', 'user': 'pilgrim', 'database': 'master'}</samp></pre>
<ol>
<li>The <code><dfn>split</dfn>()</code> string method has one required argument, a delimiter. The method splits a string into a list of strings based on the delimiter. Here, the delimiter is an ampersand character, but it could be anything.
<li>Now we have a list of strings, each with a key, followed by an equals sign, followed by a value. We can use a <a href=comprehensions.html#listcomprehension>list comprehension</a> to iterate over the entire list and split each string into two strings based on the first equals sign. The optional second argument to the <code>split()</code> method is the number of times you want to split. <code>1</code> means “only split once,” so the <code>split()</code> method will return a two-item list. (In theory, a value could contain an equals sign too. If you just used <code>'key=value=foo'.split('=')</code>, you would end up with a three-item list <code>['key', 'value', 'foo']</code>.)
<li>Finally, Python can turn that list-of-lists into a dictionary simply by passing it to the <code>dict()</code> function.
</ol>
<blockquote class=note>
<p><span class=u>☞</span>The previous example looks a lot like parsing query parameters in a <abbr>URL</abbr>, but real-life <abbr>URL</abbr> parsing is actually more complicated than this. If you’re dealing with <abbr>URL</abbr> query parameters, you’re better off using the <a href=http://docs.python.org/3.1/library/urllib.parse.html#urllib.parse.parse_qs><code>urllib.parse.parse_qs()</code></a> function, which handles some non-obvious edge cases.
</blockquote>
<h3 id=slicingstrings>Slicing A String</h3>
<p>Once you’ve defined a string, you can get any part of it as a new string. This is called <i>slicing</i> the string. Slicing strings works exactly the same as <a href=native-datatypes.html#slicinglists>slicing lists</a>, which makes sense, because strings are just sequences of characters.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>a_string = 'My alphabet starts where your alphabet ends.'</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>a_string[3:11]</kbd> <span class=u>①</span></a>
<samp class=pp>'alphabet'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_string[3:-3]</kbd> <span class=u>②</span></a>
<samp class=pp>'alphabet starts where your alphabet en'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_string[0:2]</kbd> <span class=u>③</span></a>
<samp class=pp>'My'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_string[:18]</kbd> <span class=u>④</span></a>
<samp class=pp>'My alphabet starts'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>a_string[18:]</kbd> <span class=u>⑤</span></a>
<samp class=pp>' where your alphabet ends.'</samp></pre>
<ol>
<li>You can get a part of a string, called a “slice”, by specifying two indices. The return value is a new string containing all the characters of the string, in order, starting with the first slice index.
<li>Like slicing lists, you can use negative indices to slice strings.
<li>Strings are zero-based, so <code>a_string[0:2]</code> returns the first two items of the string, starting at <code>a_string[0]</code>, up to but not including <code>a_string[2]</code>.
<li>If the left slice index is 0, you can leave it out, and 0 is implied. So <code>a_string[:18]</code> is the same as <code>a_string[0:18]</code>, because the starting 0 is implied.
<li>Similarly, if the right slice index is the length of the string, you can leave it out. So <code>a_string[18:]</code> is the same as <code>a_string[18:44]</code>, because this string has 44 characters. There is a pleasing symmetry here. In this 44-character string, <code>a_string[:18]</code> returns the first 18 characters, and <code>a_string[18:]</code> returns everything but the first 18 characters. In fact, <code>a_string[:<var>n</var>]</code> will always return the first <var>n</var> characters, and <code>a_string[<var>n</var>:]</code> will return the rest, regardless of the length of the string.
</ol>
<p class=a>⁂
<h2 id=byte-arrays>Strings vs. Bytes</h2>
<p><dfn>Bytes</dfn> are bytes; characters are an abstraction. An immutable sequence of Unicode characters is called a <i>string</i>. An immutable sequence of numbers-between-0-and-255 is called a <i>bytes</i> object.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>by = b'abcd\x65'</kbd> <span class=u>①</span></a>
<samp class=p>>>> </samp><kbd class=pp>by</kbd>
<samp class=pp>b'abcde'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>type(by)</kbd> <span class=u>②</span></a>
<samp class=pp><class 'bytes'></samp>
<a><samp class=p>>>> </samp><kbd class=pp>len(by)</kbd> <span class=u>③</span></a>
<samp class=pp>5</samp>
<a><samp class=p>>>> </samp><kbd class=pp>by += b'\xff'</kbd> <span class=u>④</span></a>
<samp class=p>>>> </samp><kbd class=pp>by</kbd>
<samp class=pp>b'abcde\xff'</samp>
<a><samp class=p>>>> </samp><kbd class=pp>len(by)</kbd> <span class=u>⑤</span></a>
<samp class=pp>6</samp>
<a><samp class=p>>>> </samp><kbd class=pp>by[0]</kbd> <span class=u>⑥</span></a>
<samp class=pp>97</samp>
<a><samp class=p>>>> </samp><kbd class=pp>by[0] = 102</kbd> <span class=u>⑦</span></a>
<samp class=traceback>Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bytes' object does not support item assignment</samp></pre>
<ol>
<li>To define a <code>bytes</code> object, use the <code>b''</code> “<dfn>byte</dfn> literal” syntax. Each byte within the byte literal can be an <abbr>ASCII</abbr> character or an encoded hexadecimal number from <code>\x00</code> to <code>\xff</code> (0–255).
<li>The type of a <code>bytes</code> object is <code>bytes</code>.
<li>Just like lists and strings, you can get the length of a <code>bytes</code> object with the built-in <code>len()</code> function.
<li>Just like lists and strings, you can use the <code>+</code> operator to concatenate <code>bytes</code> objects. The result is a new <code>bytes</code> object.
<li>Concatenating a 5-byte <code>bytes</code> object and a 1-byte <code>bytes</code> object gives you a 6-byte <code>bytes</code> object.
<li>Just like lists and strings, you can use index notation to get individual bytes in a <code>bytes</code> object. The items of a string are strings; the items of a <code>bytes</code> object are integers. Specifically, integers between 0–255.
<li>A <code>bytes</code> object is immutable; you can not assign individual bytes. If you need to change individual bytes, you can either use <a href=#slicingstrings>string slicing</a> and concatenation operators (which work the same as strings), or you can convert the <code>bytes</code> object into a <code>bytearray</code> object.
</ol>
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>by = b'abcd\x65'</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>barr = bytearray(by)</kbd> <span class=u>①</span></a>
<samp class=p>>>> </samp><kbd class=pp>barr</kbd>
<samp class=pp>bytearray(b'abcde')</samp>
<a><samp class=p>>>> </samp><kbd class=pp>len(barr)</kbd> <span class=u>②</span></a>
<samp class=pp>5</samp>
<a><samp class=p>>>> </samp><kbd class=pp>barr[0] = 102</kbd> <span class=u>③</span></a>
<samp class=p>>>> </samp><kbd class=pp>barr</kbd>
<samp class=pp>bytearray(b'fbcde')</samp></pre>
<ol>
<li>To convert a <code>bytes</code> object into a mutable <code>bytearray</code> object, use the built-in <code>bytearray()</code> function.
<li>All the methods and operations you can do on a <code>bytes</code> object, you can do on a <code>bytearray</code> object too.
<li>The one difference is that, with the <code>bytearray</code> object, you can assign individual bytes using index notation. The assigned value must be an integer between 0–255.
</ol>
<p>The one thing you <em>can never do</em> is mix bytes and strings.
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>by = b'd'</kbd>
<samp class=p>>>> </samp><kbd class=pp>s = 'abcde'</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>by + s</kbd> <span class=u>①</span></a>
<samp class=traceback>Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str</samp>
<a><samp class=p>>>> </samp><kbd class=pp>s.count(by)</kbd> <span class=u>②</span></a>
<samp class=traceback>Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly</samp>
<a><samp class=p>>>> </samp><kbd class=pp>s.count(by.decode('ascii'))</kbd> <span class=u>③</span></a>
<samp class=pp>1</samp></pre>
<ol>
<li>You can’t concatenate bytes and strings. They are two different data types.
<li>You can’t count the occurrences of bytes in a string, because there are no bytes in a string. A string is a sequence of characters. Perhaps you meant “count the occurrences of the string that you would get after decoding this sequence of bytes in a particular character encoding”? Well then, you’ll need to say that explicitly. Python 3 won’t <dfn>implicitly</dfn> convert bytes to strings or strings to bytes.
<li>By an amazing coincidence, this line of code says “count the occurrences of the string that you would get after decoding this sequence of bytes in this particular character encoding.”
</ol>
<p>And here is the link between strings and bytes: <code>bytes</code> objects have a <code><dfn>decode</dfn>()</code> method that takes a character encoding and returns a string, and strings have an <code><dfn>encode</dfn>()</code> method that takes a character encoding and returns a <code>bytes</code> object. In the previous example, the decoding was relatively straightforward — converting a sequence of bytes in the <abbr>ASCII</abbr> encoding into a string of characters. But the same process works with any encoding that supports the characters of the string — even legacy (non-Unicode) encodings.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd class=pp>a_string = '深入 Python'</kbd> <span class=u>①</span></a>
<samp class=p>>>> </samp><kbd class=pp>len(a_string)</kbd>
<samp class=pp>9</samp>
<a><samp class=p>>>> </samp><kbd class=pp>by = a_string.encode('utf-8')</kbd> <span class=u>②</span></a>
<samp class=p>>>> </samp><kbd class=pp>by</kbd>
<samp class=pp>b'\xe6\xb7\xb1\xe5\x85\xa5 Python'</samp>
<samp class=p>>>> </samp><kbd class=pp>len(by)</kbd>
<samp class=pp>13</samp>
<a><samp class=p>>>> </samp><kbd class=pp>by = a_string.encode('gb18030')</kbd> <span class=u>③</span></a>
<samp class=p>>>> </samp><kbd class=pp>by</kbd>
<samp class=pp>b'\xc9\xee\xc8\xeb Python'</samp>
<samp class=p>>>> </samp><kbd class=pp>len(by)</kbd>
<samp class=pp>11</samp>
<a><samp class=p>>>> </samp><kbd class=pp>by = a_string.encode('big5')</kbd> <span class=u>④</span></a>
<samp class=p>>>> </samp><kbd class=pp>by</kbd>
<samp class=pp>b'\xb2`\xa4J Python'</samp>
<samp class=p>>>> </samp><kbd class=pp>len(by)</kbd>
<samp class=pp>11</samp>
<a><samp class=p>>>> </samp><kbd class=pp>roundtrip = by.decode('big5')</kbd> <span class=u>⑤</span></a>
<samp class=p>>>> </samp><kbd class=pp>roundtrip</kbd>
<samp class=pp>'深入 Python'</samp>
<samp class=p>>>> </samp><kbd class=pp>a_string == roundtrip</kbd>
<samp class=pp>True</samp></pre>
<ol>
<li>This is a string. It has nine characters.
<li>This is a <code>bytes</code> object. It has 13 bytes. It is the sequence of bytes you get when you take <var>a_string</var> and encode it in <abbr>UTF-8</abbr>.
<li>This is a <code>bytes</code> object. It has 11 bytes. It is the sequence of bytes you get when you take <var>a_string</var> and encode it in <a href=http://en.wikipedia.org/wiki/GB_18030>GB18030</a>.
<li>This is a <code>bytes</code> object. It has 11 bytes. It is an <em>entirely different sequence of bytes</em> that you get when you take <var>a_string</var> and encode it in <a href=http://en.wikipedia.org/wiki/Big5>Big5</a>.
<li>This is a string. It has nine characters. It is the sequence of characters you get when you take <var>by</var> and decode it using the Big5 encoding algorithm. It is identical to the original string.
</ol>
<p class=a>⁂
<h2 id=py-encoding>Postscript: Character Encoding Of Python Source Code</h2>
<p>Python 3 assumes that your source code — <i>i.e.</i> each <code>.py</code> file — is encoded in <abbr>UTF-8</abbr>.
<blockquote class='note compare python2'>
<p><span class=u>☞</span>In Python 2, the <dfn>default</dfn> encoding for <code>.py</code> files was <abbr>ASCII</abbr>. In Python 3, <a href=http://www.python.org/dev/peps/pep-3120/>the default encoding is <abbr>UTF-8</abbr></a>.
</blockquote>
<p>If you would like to use a different encoding within your Python code, you can put an encoding declaration on the first line of each file. This declaration defines a <code>.py</code> file to be windows-1252:
<pre class='nd pp'><code># -*- coding: windows-1252 -*-</code></pre>
<p>Technically, the character encoding override can also be on the second line, if the first line is a <abbr>UNIX</abbr>-like hash-bang command.
<pre class='nd pp'><code>#!/usr/bin/python3
# -*- coding: windows-1252 -*-</code></pre>
<p>For more information, consult <a href=http://www.python.org/dev/peps/pep-0263/><abbr>PEP</abbr> 263: Defining Python Source Code Encodings</a>.
<p class=a>⁂
<h2 id=furtherreading>Further Reading</h2>
<p>On Unicode in Python:
<ul>
<li><a href=http://docs.python.org/3.1/howto/unicode.html>Python Unicode HOWTO</a>
<li><a href=http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit>What’s New In Python 3: Text vs. Data Instead Of Unicode vs. 8-bit</a>
<li><a href=http://www.python.org/dev/peps/pep-0261/><abbr>PEP 261</abbr></a> explains how Python handles astral characters outside of the Basic Multilingual Plane (<i>i.e.</i> characters whose ordinal value is greater than 65535)
</ul>
<p>On Unicode in general:
<ul>
<li><a href=http://www.joelonsoftware.com/articles/Unicode.html>The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</a>
<li><a href=http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode>On the Goodness of Unicode</a>
<li><a href=http://www.tbray.org/ongoing/When/200x/2003/04/13/Strings>On Character Strings</a>
<li><a href=http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF>Characters vs. Bytes</a>
</ul>
<p>On character encoding in other formats:
<ul>
<li><a href=http://feedparser.org/docs/character-encoding.html>Character encoding in XML</a>
<li><a href=http://blog.whatwg.org/the-road-to-html-5-character-encoding>Character encoding in HTML</a>
</ul>
<p>On strings and string formatting:
<ul>
<li><a href=http://docs.python.org/3.1/library/string.html><code>string</code> — Common string operations</a>
<li><a href=http://docs.python.org/3.1/library/string.html#formatstrings>Format String Syntax</a>
<li><a href=http://docs.python.org/3.1/library/string.html#format-specification-mini-language>Format Specification Mini-Language</a>
<li><a href=http://www.python.org/dev/peps/pep-3101/><abbr>PEP</abbr> 3101: Advanced String Formatting</a>
</ul>
<p class=v><a href=comprehensions.html rel=prev title='back to “Comprehensions”'><span class=u>☜</span></a> <a href=regular-expressions.html rel=next title='onward to “Regular Expressions”'><span class=u>☞</span></a>
<p class=c>© 2001–11 <a href=about.html>Mark Pilgrim</a>
<script src=j/jquery.js></script>
<script src=j/prettify.js></script>
<script src=j/dip3.js></script>