Skip to content

Commit fd8f7c8

Browse files
Grinnzkhwilliamson
authored andcommitted
perlrun: add caution that the -C flag does not validate nor produce UTF-8
1 parent 1d7f1a4 commit fd8f7c8

File tree

1 file changed

+22
-10
lines changed

1 file changed

+22
-10
lines changed

pod/perlrun.pod

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -279,19 +279,31 @@ X<-C>
279279

280280
The B<-C> flag controls some of the Perl Unicode features.
281281

282+
B<CAUTION:> As with the L<C<:utf8> PerlIO layer|PerlIO/:utf8>, none of
283+
the features enabled by this flag or the equivalent C<PERL_UNICODE>
284+
environment variable validate that input is valid UTF-8, nor guarantee
285+
to produce valid UTF-8. Instead it will assume input is provided in
286+
Perl's internal upgraded byte encoding, and provide output in this
287+
encoding, which is a superset of UTF-8 that can encode any character
288+
allowed in Perl strings. (On EBCDIC systems, it is a superset of
289+
UTF-EBCDIC instead.) This can result in broken Perl strings or output
290+
bytes which are not valid in UTF-8. This internal encoding will be
291+
referred to as C<utf8> below to differentiate it from a strict UTF-8
292+
encoding format.
293+
282294
As of 5.8.1, the B<-C> can be followed either by a number or a list
283295
of option letters. The letters, their numeric values, and effects
284296
are as follows; listing the letters is equal to summing the numbers.
285297

286-
I 1 STDIN is assumed to be in UTF-8
287-
O 2 STDOUT will be in UTF-8
288-
E 4 STDERR will be in UTF-8
298+
I 1 STDIN is assumed to be in utf8
299+
O 2 STDOUT will be in utf8
300+
E 4 STDERR will be in utf8
289301
S 7 I + O + E
290-
i 8 UTF-8 is the default PerlIO layer for input streams
291-
o 16 UTF-8 is the default PerlIO layer for output streams
302+
i 8 :utf8 is the default PerlIO layer for input streams
303+
o 16 :utf8 is the default PerlIO layer for output streams
292304
D 24 i + o
293305
A 32 the @ARGV elements are expected to be strings encoded
294-
in UTF-8
306+
in utf8
295307
L 64 normally the "IOEioA" are unconditional, the L makes
296308
them conditional on the locale environment variables
297309
(the LC_ALL, LC_CTYPE, and LANG, in the order of
@@ -307,22 +319,22 @@ perl.h gives W/128 as PERL_UNICODE_WIDESYSCALLS "/* for Sarathy */"
307319
perltodo mentions Unicode in %ENV and filenames. I guess that these will be
308320
options e and f (or F).
309321

310-
For example, B<-COE> and B<-C6> will both turn on UTF-8-ness on both
322+
For example, B<-COE> and B<-C6> will both turn on utf8-ness on both
311323
STDOUT and STDERR. Repeating letters is just redundant, not cumulative
312324
nor toggling.
313325

314326
The C<io> options mean that any subsequent open() (or similar I/O
315327
operations) in main program scope will have the C<:utf8> PerlIO layer
316-
implicitly applied to them, in other words, UTF-8 is expected from any
317-
input stream, and UTF-8 is produced to any output stream. This is just
328+
implicitly applied to them, in other words, utf8 is expected from any
329+
input stream, and utf8 is produced to any output stream. This is just
318330
the default set via L<C<${^OPEN}>|perlvar/${^OPEN}>,
319331
with explicit layers in open() and with binmode() one can
320332
manipulate streams as usual. This has no effect on code run in modules.
321333

322334
B<-C> on its own (not followed by any number or option list), or the
323335
empty string C<""> for the L</PERL_UNICODE> environment variable, has the
324336
same effect as B<-CSDL>. In other words, the standard I/O handles and
325-
the default C<open()> layer are UTF-8-fied I<but> only if the locale
337+
the default C<open()> layer are utf8-fied I<but> only if the locale
326338
environment variables indicate a UTF-8 locale. This behaviour follows
327339
the I<implicit> (and problematic) UTF-8 behaviour of Perl 5.8.0.
328340
(See L<perl581delta/UTF-8 no longer default under UTF-8 locales>.)

0 commit comments

Comments
 (0)