Grokking Go Runes
Something that I didn't pay much attention to during my first project with Go was the idea of a Rune. Now I've got more time to play with the language I thought I'd do some digging and see what came up. It turns out, knowing why Runes exist and how to deal with them is quite important.
A Rune deals with text, as would a string you might think, but a string is all about the bytes and not necessarily the characters within the string. For instance, as it happens "Hello World" contains 11 characters, and as they are English characters, they are represented in UTF-8, just as they are in ASCII, as 1 byte per character, therefore in this particular case 11 characters equals 11 bytes.
Go is all about UTF-8 encoding meaning you can use characters in your strings that aren't capable of being represented in ASCII. This also means that a string might contain more bytes than it does characters. For instance, "今日は" is hello in Japanese (or at least one form of it). The string contains three characters but remember characters do not necessarily equal bytes so let's find out what the length actually is:
hello := "今日は"
fmt.Println(len(hello))
The answer is 9. Surprised? Remember Go supports UTF-8 encoding which means it can store much more than ASCII. If you only ever deal with English maybe ASCII is all you've ever needed but UTF-8 supports much more. However, it also means it needs more bytes to store those characters outside the range of ASCII. We can see which bytes make up each character of our string with the following snippet:
for _, c := range hello {
fmt.Printf("%c, % X\n", c, []byte(string(c)))
}
今, E4 BB 8A
日, E6 97 A5
は, E3 81 AF
Here we loop through each Rune (or character, or code point) in the text meaning each pass results in 3 bytes being read which we print out along with the displayable character. It's important to realise that the range construct knows how to return each character regardless of how many bytes it consumes. If we try to extract say, the first character from the string via an index we wouldn't get the result you might expect:
fmt.Printf("%c\n", hello[0])
ä
What happened? Well, instead of trying to get the character how about the byte?
fmt.Printf("%X\n", hello[0])
E4
Here we only got the first byte of the first character and it looks like E4 in UTF-8 encoding is represented as ä. Not what we wanted because our character is represented by three bytes. However if we convert the string to a rune slice first:
helloRunes := []rune("今日は")
fmt.Printf("%c", helloRunes[0])
fmt.Printf("%c", helloRunes[1])
fmt.Printf("%c", helloRunes[2])
then we get the right result:
今日は
If we want to know the index of where a particular rune starts within a string we can use a function from the strings package:
idx := strings.IndexRune(hello, '日')
fmt.Printf("index of %c starts at %d\n", '日', idx)
index of 日 starts at 3
So now we know why a rune exists and when we might want to use it. Worth keeping in mind, especially when you want to start dealing with slices and indexes.
polyglot programmer - passionate, opinionated, cynic, realist