Talk:Character (computing)

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Disambig?[edit]

Note partial overlap with brief description at Character, and numerous links that point there that perhaps should point here. --Brion 21:41 Aug 19, 2002 (PDT)

I agree. The most prominent link is Character (word) and it should be mentioned in the article. Character (computing) is just an abstraction of it. Mordomo (talk) 13:03, 23 December 2007 (UTC)[reply]
Hold on there. You're replying to a 5+ year old comment, when the character disambig page looked like this and the character (computing) article looked like this. Regardless, you seem to be misunderstanding what Brion was saying. He was saying that in various other articles, there are references to computing-type "characters" that are linking the word character to the general character disambig page, whereas the links really should point to the character (computing) article instead. That may still be true; I haven't checked. Brion said there was "overlap" between the two, but I think he was just not realizing (since it wasn't tagged) that character was a disambig page.
See, character is a disambiguation page, meant only to direct readers to the various "character"-related articles on Wikipedia. Character (computing) is one such article; it describes the kind(s) of characters used in computing and other telecommunication technology. You might think of the disambig page as covering character in the general sense, and character (computing) being a specialization, not abstraction, of it.
The character (word) article appears to actually be a Wiktionary entry masquerading as a Wikipedia article / disambiguation page, itself. Its content is just a summary of several different definitions of the word character, including a character in literature and applications of the word character to what'd be more accurately described with more specific terms (e.g., grapheme). Up until now (I just changed it), it was also somewhat mischaracterized on the current version of the character disambig page. I'm going to nominate it for deletion or merging since it's serving the same purpose as character and the Wiktionary entry.
In any case, there's generally no need to link back to a disambig page from one of the disambiguated articles. So there does not need to be a mention of either character or character (word) in this article. —mjb (talk) 02:40, 24 December 2007 (UTC)[reply]

Text as a Medium[edit]

The following was added to the article by an anonymous contributor. It was removed because it strays from the main topic and contains generalizations and inaccuracies in contradiction with the rest of the article. It also contains some info that may be useful to incorporate into the article; the relationship between astract characters, bytes, and different aspects of glyph renderings might be worthy of inclusion, to the extent that it helps the reader understand the nature of a "character" in computing and machine-based telecom. However, I feel strongly that this article should not be a full tutorial on all things "text" in computers; there is precedent for creating separate articles for these broader topics. — mjb 07:54, 26 September 2005 (UTC)[reply]

Text, as it is represented on a computer, is a linear sequence of bytes that are mapped from bytes (or collections of bytes) onto glyphs.
Sometimes these glyphs, or their numeric byte forms represent special relationships between the glyphs, such as those used in markup languages. These relationships may be semantic or stylistics.
The linear sequence of bytes is transformed into a tree structure for the purposes of determining both forms of relationships. The linear sequence is broken up into pieces and then annotated with further, more specialized and detailed linear sequences of bytes within this tree that represent the location and style each glyph will be presented in its visual format.
After the positioning and style has been determined, the linear sequences of bytes representing each text fragment are transformed into either vector or raster representations of each character. Sometimes, the positioning, layout and form of the glyphs depend on cursory information about the extent and classification of each glyph. Kerning is one process by which glyphs are combined based on rules regarding each character in relation to other characters to achieve an aesthetically pleasing layout of those characters.
Once the layout and visualization is determined, the glyphs are transformed from a tree populated with annotations and glyphs into a two dimensional raster image that may be broken up across pages based on higher level rules regarding the placement of larger combined units of text (such as paragraphs and their visual relationships to each other.
The positioning of glyphs in relation to each other is also defined using an order acceptable in the language that the glyphs are part of. For example, some texts are ordered glyph:left to right, line:top to bottom, page:front to back whereas others may be ordered glyph:top to bottom, line:right to left, page:back to front. Other orderings exist.

Pronunciation of "char"[edit]

How do you read "char" in the context of a programming language (such as C/C++, Java or Pascal)?

Does it sound like the verb "char", like "car", or like "care"?

aditsu 08:48, 12 April 2006 (UTC)[reply]

The Jargon File gives all three pronunciations. Most American programmers I've spoken with use the "char" pronunciation, since that's how it's written. "Car" seems to be the rarest pronunciation -- could be because it collides with the name (taken from Lisp) of a primitive operation on linked lists. --FOo 09:09, 12 April 2006 (UTC)[reply]
I always pronounced it like "care", maybe because im from the westcoast. 75.15.236.62
I think most people just pronounce it how it looks, regardless of the fact that it's a short form of "character". —Preceding unsigned comment added by Indigochild777 (talkcontribs) 01:32, 12 April 2010 (UTC)[reply]

Char vs. Bit/byte[edit]

What is the difference between a char (character) and a byte/bit? I work with some programing and cant figure it out, Is a byte and a character the same length i.e. does 255 bytes/bits equal 255 characters? 75.15.236.62

Only if your characters are formed from single octets. So this bytes=characters stuff was widely true in the 70s and 80s when ASCII and EBCDIC reigned supreme, but now, as multi-byte characters sets such as Unicode are becoming common, you an no longer assume that one character equals one byte.
Atlant 12:53, 27 April 2007 (UTC)[reply]

What's this called?[edit]

I want to know what to call those symbols in computers that mean it can't understand what they mean. An example of this is: ɪ (If you have a more advanced computer, you may see it as an acual character). C Teng (talk) 00:02, 11 March 2008 (UTC)[reply]

Good question. They don't have a name that I'm aware of - they're just placeholders. Dcoetzee 00:09, 11 March 2008 (UTC)[reply]
That particular character, ɪ, which looks like a miniature capital "I", is U+026A: "lax close front unrounded vowel" ... it is a phonetic symbol and refers to what we call the short "i" sound in English. See International Phonetic Alphabet, IPA chart for English, and Help:IPA for more info. IPA characters are used in pronunciation guides in Wikipedia articles, usually via one of the IPA templates.
If you see it as a box or a diamond with a question mark in it, then it is indeed just a placeholder; it means that the font your browser tried to use to show it to you does not contain a glyph for that character, and the browser wasn't able to find a suitable glyph in another font. Some browsers are better at finding missing glyphs than others. —mjb (talk) 01:02, 11 March 2008 (UTC)[reply]
So, these placeholders don't have names? C Teng (talk) 02:23, 14 March 2008 (UTC)[reply]
If the browser substituted a diamond-with-question-mark, that symbol probably was picked by the programmer from one of the Unicode fonts that the browser is supposed to have in any case, so it probably does have a Unicode name like "black diamond with question mark inside". If what you see is a little square with four tiny digits in it, then the browser is not substituting the character, but merey using a made-up font where the glyph for each Unicode character is the character's code point written in that form. In that case the name of the character is still "lax close front unrounded vowel"; it is only its appearance in that font that is rather unconventional (and pretty useless). Hope it helps, --Jorge Stolfi (talk) 13:37, 29 June 2009 (UTC)[reply]

The Unicode definition of character[edit]

Unicode is great and, all things considered, fairly well designed. However, its concept of "character" is far from consistent. For example, while it distinguishes aleph-hebrew-letter from aleph-math-symbol, it does not distinguish between X-roman-letter and X-math-symbol.These decisions are not based on logic, but are merely attempts to accomodate the habits of programmers from each language community. Thus, it does distinguish between capital "X" and lowercase "X", because those already had different codes in ASCII and ISO-Latin-1; but it does not distinguish between boldface "X" and italic "X", or between the two common shapes of italic "a", because programmers were used to encode these attributes separately from the characters. So, please let's not make the mistake of presenting the Unicode's definition as the word of the Gods. All the best, --Jorge Stolfi (talk) 13:27, 29 June 2009 (UTC)[reply]

Thinking that some abstract distinction between aleph-hebrew and aleph-math is meaningful is a bit narrowminded. I do not know how well Unicode handles different levels of abstraction, but a major consideration in multiplying the various possible different meanings of a glyph is whether it would be USEFUL. Note that 'logic' isn't directly involved. IOW, one question the group developing the Unicode standard must have had was: should glyph xxxx be represented as more than one abstract character or can it be 'translated' unambiguously using just one? To some extent this requires a mapping of each 'character' between all of the 'languages' that use it. (This has to be superimposed over the code pages already (historically) present which needed to be included in the standard.) Unfortunately, Unicode failed to invent a different term for 'abstract character' so guaranteeing its truncation by some authors to 'character' which already was quite ambiguous/multimeaninged. I wonder if they considered it, and found that even if they invented a term, they couldn't define it definitively (so as to make it sufficiently useful in a large number of contexts).72.172.1.40 (talk) 20:24, 25 September 2014 (UTC)[reply]

The UTF-8 Google statistics[edit]

The article claimed that "According to statistics collected by Google, UTF-8 is the most common encoding used on web pages". No question about that; however, that statistic only refers to the encoding used for web pages that are delivered by HTTP servers over the net. But I believe that servers can be configured to automatically map national character encodings to UTF-8, for the benefit of foreign browsers; is that correct? In any case, since Unicode/UTF-8 is the de-facto (if not official) standard encoding for HTML files, that statistics does not say much about the adoption of Unicode/UTF-8 for other kinds of files. All the best, --Jorge Stolfi (talk) 14:19, 29 June 2009 (UTC)[reply]

"word character"[edit]

Somebody added a page describing a "word character". The links points to a description of regexp that actually describes a "whitespace character" in the same paragraph, and several other character classes nearby. IMHO this has nothing to do with the basic subject of characters, and there certainly is nothing special about "word character", should we perhaps list every single classification used by any regexp library here?Spitzak (talk) 19:18, 5 July 2011 (UTC)[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on Character (computing). Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 16:59, 19 November 2016 (UTC)[reply]

A Unicode character requires 21 bits?[edit]

I thank Spitzak for his prompt response to my request for a source for the claim that "Unicode requires at least 21 bits to store a single code point". However, I fail to find this asserted in the source provided, the Glossary of Unicode Terms. Which entry in the glossary is relevant? Using UTF-8, every code point in the BMP can be encoded in a mere 16 bits. Peter Brown (talk) 22:42, 4 May 2019 (UTC)[reply]

The last Unicode code point is U+10FFFF which requires 21 bits. I just copied the ref from the first mention of how many characters there are from the Unicode article, I guess a better reference needs to be found and both articles fixed.Spitzak (talk) 00:38, 5 May 2019 (UTC)[reply]
The text that Spitzak recently restored includes the sentence
As there are 1,112,064 (or 1,114,112 including surrogate halves) Unicode code points, more than 20 bits are needed to identify them.
According to the official definition of a code point, it is an integer in the range 0 to 10FFFF16. By "identify them", I assume Spitzak means represent them as bit strings, though of course there are many other ways of representing nonnegative integers, e.g. as decimal digits. Why, though, is there any need to represent every single one of them? The code points in 11 of the 17 Unicode planes, numbered 3 through 13, are not assigned to characters; that is well over half the code points. An encoding that assigned the integers in planes 0–2 (016–2FFFF16) to characters as does Unicode but the integers 3000016–5FFFF16 to the characters represented in Unicode by code points in planes 14–16 would require only 19 bits for the associated numbers.
On most systems, even 19 bits is too much for a char, so the conclusion stands that no encoding encompassing all the characters represented in Unicode will fit in a char.
Peter Brown (talk) 19:14, 16 May 2019 (UTC)[reply]
Really I can't believe how much grief this trivial bit of information takes. First, it is a basic law of the universe that it takes 21 bits to have at least 0x110000 different values. And even if you only use assigned values (meaning the translation from stored value to Unicode code point would be a giant look up table?) you still need 18 bits. The actual information being communicated here is "A Unicode code point will not fit in a char on most systems". It just seemed nice to provide an approximate number showing just how much it "wont fit" and possibly to show that 16 bits is not sufficient either. If this really bothers you I guess we can just reduce the whole mess to "A Unicode code point will not fit in a char on most systems"Spitzak (talk) 20:14, 16 May 2019 (UTC)[reply]
You're making a category mistake, as does the article. Unicode code points, by the definition cited above and in the article, are integers, abstract mathematical entities. "Fitting" is a spatial concept, not applicable to such entities. An integer may be defined as an equivalence class of ordered pairs of natural numbers—see this discussion. What on earth is it for an equivalence class to fit or not to fit in a char?
Also, whether or not it's nice to provide approximations, they definitely should be labeled as such. (Such labeling, though, is likely to be criticized as {{vague}}).
Peter Brown (talk) 21:37, 16 May 2019 (UTC)[reply]
Not sure what word you propose instead of "fit", but there is no approximation here at all. N bits can be set to 2^N unique patterns and thus store 2^N different values. If you have M values, due to the pigeonhole principle you cannot store all of them as a unique pattern in N bits unless 2^N >= M. If you can think of a better word than "fit" for "can store all possible values as a unique representation in the bits" then please update it, but I am really finding it hard to think of a word that is better than "fit". As ln2(0x110000) = 20.087462841250343, the smallest value of N where 2^N is greater is 21.Spitzak (talk) 23:10, 16 May 2019 (UTC)[reply]
A fair challenge! How about
Over 270,000 Unicode code points have been assigned as of March, 2019, some to visible characters and some to invisible ones like the horizontal tab and characters needed to support bi-directional text. If each of these many code points is to be represented with a unique string of on and off bits, it is necessary that at least some of the strings consist of no fewer than 19 bits, too many to fit in a char.
The source for the number 270,000 can be a Babelstone web page. My text makes reference to bit strings, not code points, as fitting. All you really need to establish in a section on chars is that they are inadequate to handle all the characters that most systems want to be able to handle, which includes the ones to which code points have been assigned.
I don't know where you get the number 18 in one of your contributions above. 218 is only 262,144, not enough for the assigned code points.
Peter Brown (talk) 01:34, 17 May 2019 (UTC)[reply]
I thought there were only about 150,000 assigned points, that was why I said 18. I feel like there is no reason for this complexity, and you are still talking about a variable-length code when you say "at least some of the strings consist of no fewer than 19 bits". Fixed-sized chars mean they all have the same number of bits, what number is stored inside them is irrelevant (except it can't be larger than that many bits).Spitzak (talk) 23:29, 17 May 2019 (UTC)[reply]
We do have to recognize, like it or not, that some encodings are variable length. In drafting the sentence you object to, I initially wrote "it is necessary that the strings consist of 19 bits" but then I thought, "Oh yeah, strings can be shorter in a variable-length encoding" and inserted "at least some of". I didn't think that you object; after all, if all strings consist of at least 19 bits, as in UTF-32, then some strings consist of at least 19 bits. Suppose I changed the wording to "the strings — or at least some of them — consist ...". Would that meet with your approval? Or do you think it necessary to refer explicitly to fixed- and variable-length encoding schemes?
I'm not sure what you see as too complex. In the first of the two sentences, it would simplify matters if I drop the text after "March, 2019," but I thought it worth while to forestall the quite reasonable assumption that a "character" is something visible, a "graphic" in Unicode terminology. Or I could replace all the text from "assigned" to the period with the phrase "to characters". Shall I do so? And do you see unnecessary complexity in the second sentence?
Peter Brown (talk) 02:32, 18 May 2019 (UTC)[reply]
My reading of the paragraph is to talk about the fact that a fixed-sized encoding would require 21 bits, chars have been stated in the previous paragraph as usually having less than 21 bits, and then say "therefore a variable-length encoding is used". This means the first phrase is specifically about not-variable-length encoding! Your wording "some Unicode code points require 21 bits" means some Unicode code points need a different size, which is the definition of a variable length encoding! But this is the first phrase in a sentence that then says "instead a variable-length encoding is used". So the first phrase MUST talk about NOT variable-length. Really this should be trivial, I don't understand what your problem here is.Spitzak (talk) 19:39, 18 May 2019 (UTC)[reply]
I thought it clearer to use language that applied equally to fixed- and variable-length encodings, but it is certainly possible to differentiate the cases. How about:
Over 270,000 Unicode code points have been assigned so far, some to visible characters and some to invisible ones like the horizontal tab and characters needed to support bi-directional text. If each of these many code points is to be represented with a unique string of on and off bits and the bit strings are all to be the same length, then that length has to be at least 19 bits. To conserve storage, it is common to allow the strings to vary in length, as in the encoding UTF-8. Some must still have a length of at least 19 bits, however. Therefore, if a system is to represent all the characters to which code points have been assigned, some of the representations need to consist of 19 bits or more, too many to fit in a single char on most systems, so more than one char is needed in these cases.
My thought is that this text could replace the first two sentences of the second paragraph in the char section. Perhaps the third and last sentence should be a paragraph by itself.
OK? Peter Brown (talk) 21:00, 18 May 2019 (UTC)[reply]
Sorry I did not object, I guess I should have. I thought the existing text was not too horrible, but you edited again. This was supposed to be a trivial fact that you cannot store Unicode code points in 8 bits. Excessive details of exactly how badly it does not fit in 8 bits is unnecessary. Also NOBODY is going to attempt an encoding of only assigned code points so there is no reason to go there. If this really bothers you I think the whole thing can be deleted, just say "Unicode is usually stored using a variable length encoding..." as the start of the paragraph. There is absolutely no reason for a huge bunch of useless text for a trivial fact.Spitzak (talk) 22:51, 22 May 2019 (UTC)[reply]