C Strings

First and most important of all regarding C strings is this: there is no such thing as C strings.

OK, now that that’s out of the way we can move on. The way to work with strings in C is to treat them as simple sequences of characters, a.k.a. a string in C is stored as an array of char’s. To know where the string ends, it must be terminated by a special marker, which is the ASCII character with the code 0 (commonly known as ‘’) – this is why they’re called zero-terminated C strings (we’ll see later that there are alternatives to this). Let’s see an example:
[code lang=”c”]char myString[100];

myString[0] = ‘h’;
myString[1] = ‘e’;
myString[2] = ‘l’;
myString[3] = ‘l’;
myString[4] = ‘o’;
myString[5] = ‘\0’; /* don’t forget the marker */[/code]
If you want to do the memory allocation stuff by hand, you would just change the declaration for myString to this:
[code lang=”c”]char *myString = NULL;
myString = (char *)malloc(100 * sizeof(char));[/code]
Just creating a string and putting some characters inside isn’t such a big deal. You have to be able to do all sorts of stuff with it for this to be useful, things like copy them around, splitting them up (getting sub-strings from a bigger string), putting them back together (concatenating two strings into a single one), searching for things inside them. Fortunately, the creators of the C standard library thought of us and provided functions that do all of these things and more. All you have to do to use them is:
[code lang=”c”]#include <string.h>[/code]
We’ll look in detail at some of these functions and, in the spirit of learning by doing, we’ll also try to provide our own implementations for them. Continue reading C Strings


As in so many other things, the newline (or line break or end-of-line or EOL or however you call it) is something we couldn’t agree on from the beginning so we ended up having a lot of different flavors of the same thing.

The idea is simple: the newline character or group of characters say that the very next character after it should appear on a new line, immediately following the current line. The problem is that the character(s) that represent a newline vary widely across operating systems and even different versions of the same system.

The most common forms use one or two characters to encode a newline and among these the best known version is the ASCII one (or ones, as different systems based on ASCII use different versions).
These ASCII flavors use one or both of these two characters:

  • CR (carriage return, 0X0D, usually expressed as ‘r’)
  • LF (line feed, 0X0A, usually expressed as ‘n’ in programming languages)

Example of systems that use these are:

  • CR – older versions of Mac OS
  • LF – Unix, GNU/Linux, FreeBSD, Mac OSX
  • CR+LF – Windows

If you’re using Unicode, there are also Unicode versions of these:

  • CR – U+000D
  • LF – U+000A
  • CR+LF – U+000D U+000A

OK, so why should we care about all these different notations for the same thing? If we’re developing for a single platform, probably we don’t need to care much. But seeing how the Internet becomes one big computer, the situations where you develop for one system and can be absolutely sure you will not interact with anybody else become more and more rare.

So why don’t we care if we’re developing for a single platform? Because the good people who worked on the C standard thought of this. C provides two escape sequences that represent the two codes from above. These are ‘n’ (newline) and ‘r’ (carriage return). The probably unexpected thing about these is that they’re not required to conform to the ASCII values. The only things required by the standard are:

  • each of these has a unique value that fits inside a char, but the actual value is implementation defined;
  • when writing to a text file, the newline character (‘n’) is transformed transparently to the system’s character (or character group) for newline.

What this last point means is that if you take the same piece of code that writes to a text file separating lines by ‘n’ and compile and run it on Windows and Linux for example, the two output files will be different. On Windows you will get CR+LF and on Linux just LF separating the lines.
This implies that if you’re not careful when reading such files and write code that depends on the actual character values of the newline you will run into trouble when moving files from one system to another.