Wikier

Character sets in terminals and editors

With the wrong setup, a character set can be misinterpreted by the computer system. The character set UTF-8 was designed to solve this problem.

Norsk versjon - Tegnsett i terminal og editorer

Looking for something else? Topic page about IT support | Pages marked character set

About character sets

The character set of a program decides how characters are interpreted by the computer system. It's therefore important that text is read in the same character set as it was written. The modern Unicode (UTF) is a comprehensive standard and will prevent most problems connected to character sets.

Locale is a term concerning how character sets, numbers, currency, and dates will be shown.

Character sets in a terminal emulator

The first thing you have to do is configure your terminal emulator to use UTF-8. Emulators like uxterm and urxvt only support UTF-8. In the Gnome Terminal, you might have to use the menu Terminal - Set Character Set - UTF-8 if your system isn't already set to this. In PuTTY you have to log in and choose Translation under Category and find UTF-8 in the list of Character sets on received data.

Character sets on the login.stud and login.ansatt NTNU servers

We will here look at the server login.stud.ntnu.no as an example. The same settings will work on login.ansatt.ntnu.no. This server has the english UTF-8 as the standard character set, so the setup should be fairly pain free. 

You can check which locale is in use at any given moment with the command locale:

lynx:~$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=nb_NO.UTF-8
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

If you want to change locale, it could be a good idea to first check what's available:

lynx:~$ locale -a
C
en_DK
en_DK.utf8
en_GB
en_GB.iso885915
en_GB.utf8
en_US
en_US.iso885915
en_US.utf8
nb_NO
nb_NO.utf8
nn_NO
nn_NO.utf8
POSIX

If you want to change this, you can use these commands:

lynx:~$ export LANG=en_DK.UTF-8
lynx:~$ export LC_ALL=en_DK.UTF-8You can then see the changes by writing locale

{{
lynx:~$ locale
LANG=en_DK.UTF-8
LC_CTYPE="en_DK.UTF-8"
LC_NUMERIC="en_DK.UTF-8"
LC_TIME="en_DK.UTF-8"
LC_COLLATE="en_DK.UTF-8"
LC_MONETARY="en_DK.UTF-8"
LC_MESSAGES="en_DK.UTF-8"
LC_PAPER="en_DK.UTF-8"
LC_NAME="en_DK.UTF-8"
LC_ADDRESS="en_DK.UTF-8"
LC_TELEPHONE="en_DK.UTF-8"
LC_MEASUREMENT="en_DK.UTF-8"
LC_IDENTIFICATION="en_DK.UTF-8"
LC_ALL=en_DK.UTF-8

The export command is a good command to have in .bashrc, so you don't have to write everything every time.

Note that we use the locale en_DK.UTF-8. This is English like they speak it in Denmark(!) - it gives English language, but Nordic time set, currency, maps where North is up, and similar things that are spesific to the country or region you are in. If you want the programs in Norwegian instead on English, use nb_NO.UTF-8 for Bokmål and nn_NO.UTF-8 for New Norwegian.

Character sets in text editors

UTF-8 in Vim

Vim has everything internally in UTF-8, and will convert between character sets when it handles files. You can tell Vim to always use UTF-8 (unless it detects a different character set in the file) by writing the following in the file .vimrc in your home directory:

set fileencoding=utf-8
set encoding=utf-8
set termencoding=utf-8

UTF-8 i Emacs

Emacs on lynx has support for UTF-8. If Emacs on your own computer does not have it, you most likely have to install Mule (in Debian, Ubuntu and friends: sudo aptitude install mule-ucs). When this is done, you can add the following into your .emacs (or .xemacs/custom.el):

(prefer-coding-system 'mule-utf-8)
(setq locale-coding-system 'mule-utf-8)
(set-terminal-coding-system 'mule-utf-8)
(set-keyboard-coding-system 'mule-utf-8)
(set-selection-coding-system 'mule-utf-8)

UTF-8 in irssi

The IRC client irssi has good support for different character sets. You can tell irssi that your terminal is in UTF-8 by doing the following:

/set term_charset UTF-8

If you have a relatively new version of irssi (version 0.8.10 and up), it can have different character set in different channels. You can set the standard character set:

/set recode_out_default_charset UTF-8

(or latin1 instead of UTF-8, if you want)

You can then set a character set per channel, if you are currently in the window:

/recode add UTF-8

Use "/help recode" for more advanced use.

UTF-8 in Mutt

The email client Mutt also have support for UTF-8. In the file .muttrc in your home directory, include the lines:

set charset=utf-8
set send_charset=utf-8

If you use Vim to write emails from Mutt, you might want to include the following in .vimrc:

au BufNewFile,BufRead mutt*    set tw=77 ai nocindent fileencoding=utf-8

This will set the character set, the width before the line, and indents correctly for emails.

UTF-8 in Pine

Pine does not support UTF-8. 

If you really want UTF-8 and Pine, you have to use a program  to translate all characters in and out of Pine, so that Pine will still think it runs in a ISO-8859-1 environment.

bruker@lynx:~$ luit -encoding 'ISO-8859-1' pine

This solution rarely works exactly as intended, and in some cases will not work at all. A better solution is to find an email client that supports UTF-8.

UTF-8 in a screen

A screen is commonly used to run programs, and the screen will usually base itself on the locale in your environment. It can however be forced to use UTF-8 with the switch -U when you connect to a running screen:

bruker@lynx:~$ screen -rU

If your screen was started in -C or ISO-8859-1-locale, you might have to restart it go get new windows to behave like intended.

Contact

Orakel Support Services can help if you encounter difficulties. If you are an NTNU employee, consult your local IT Support.