What Isthe GSM Alphabet?
The GSM alphabet, or GSM-7, refers to a character encoding standard that packs the most common letters and symbols in various languages into seven bits each for GSM networks.
SMS text messages are sent in 140 8-bit octets at a time. This means GSM-7 encoded messages can support up to 160 characters per text. A septet (7 bits) represents each GSM character in a text message. The ESC (escape code) character chooses the extension set for the “basic character extension” characters.
Standard SMS Length Using GSM Character
A standard SMS message can carry up to 160 GSM characters. These characters need to be part of a 7-bit alphabet, which GSM 03.38 defines. This alphabet consists of all the ASCII characters and a few accented ones. Examples include the e with a grave accent (è) and the n with a tilde (ñ).
Characters beyond this set are considered Unicode, which limits the SMS text length to 70 characters. This is because Unicode characters have different encodings.
When the number of characters passes the limit of 70 or 160 characters, the message splits into smaller segments. With standard SMS (GSM SMS), the second segment contains characters 161 to 306, whereas the third includes characters 307 to 459. This is known as SMS concatenation.
Unicode messages (international SMS) allow for seven parts, the second one consisting of characters 71 to 134, the third until 201, the fourth until 134, and so on.
GSM-7 Encoding Characteristics
Besides being supported on GSM cellular networks, GSM-7 is also required in languages with over 128 symbols. However, implementing local language support involves using shift tables or switching to UCS-2 encoding (16-bit).
As for characters like the circumflex accent and the square bracket, an escape code becomes necessary. This means that in GSM-7 encoded text, it takes two characters to encode extension symbols. That is because the extended GSM character set uses the escape prefix.
UCS-2 Encoding
UCS-2 (Universal Coded Character Set) encoding enables a wider range of characters and languages. These include the most common Latin as well as Eastern characters. But this comes at the expense of greater space. UCS-2 is also limited to Basic Multilingual Plane characters.
Because modern programming doesn’t offer encoders and decoders for UCS-2 characters, certain mobile phones use UTF-16 (Unicode Transformation Format) instead. iPhones are one example of these devices. This works since characters in UTF-16 and UCS-2 encodings are identical.
When it comes to encoding characters beyond the BMP, like emojis, UTF-16 utilizes surrogate pairs. Decoding these characters results in two valid yet unmapped code points.
Many GSM mobile phones don’t have a certain preselection of the UCS-2 encoding. They have 7-bit encoding by default. When users enter a character that's not available in the GSM-7 character set, the entire message gets reencoded in UCS-2. In this case, the character limit goes down from 160 to 70.
GSM Alphabet Vs. Unicode Character Set
GSM and Unicode are the two different encoding character types that can be in an SMS message. Short for Global System for Mobile Communications, GSM enables the use of letters and symbols that can only count as one character in a text. The GSM alphabet set is primarily in the English language and contains Latin characters, digits, and some special characters.
In terms of mobile communication, GSM is the technology that came before UMTS (Universal Mobile Telecommunications Service).
Unicode characters are common among different languages and came into existence later than GSM characters. They count as a single character as well but are different in size and complexity. While standard SMS communicates in a universal language, Unicode SMS communicates in local regional languages.
All GSM and Unicode characters use equal amounts of data. The only difference here is that Unicode uses more data for each character. This means fewer Unicode characters can fit in a text message, which explains the 70-character limit.
As both Unicode and GSM are encoding standards, the SMS text will either be entirely Unicode or entirely GSM. If you use one character that GSM doesn't support, the whole message will change to Unicode.
An example of a singular Unicode character is an emoji. Using even one Unicode character switches all the other characters in the text to Unicode.
National Language Shift Tables
Shift tables enable access to characters in SMS that are relevant to other languages. The User Data Header of an SMS message allows for selecting shift tables. While a locking shift table specifies the table for the entire text message, a single shift table is for individual characters.
With a shift table, the SMS text can still utilize 7-bit encoding for the characters. But to properly show accented or language-specific characters, a different set can be a better option. This can carry up to 155 characters encoded in 136 octets. The User Data Header uses 4 octets out of 140 to define the usage of the shift table and language code.
With locking and single shift tables together, up to 152 characters are possible. These are encoded in 133 octets (7 octets of User Data Header out of 140).
Shift tables support Spanish, Portuguese, Turkish, Urdu, Hindi, Bengali (and Assamese), Punjabi, Gujarati, Oriya, Tamil, Telugu, Kannada, and Malayalam.