C Strings

First and most important of all regarding C strings is this: there is no such thing as C strings.

OK, now that that’s out of the way we can move on. The way to work with strings in C is to treat them as simple sequences of characters, a.k.a. a string in C is stored as an array of char’s. To know where the string ends, it must be terminated by a special marker, which is the ASCII character with the code 0 (commonly known as ‘’) – this is why they’re called zero-terminated C strings (we’ll see later that there are alternatives to this). Let’s see an example:
[code lang=”c”]char myString[100];

myString[0] = ‘h’;
myString[1] = ‘e’;
myString[2] = ‘l’;
myString[3] = ‘l’;
myString[4] = ‘o’;
myString[5] = ‘\0’; /* don’t forget the marker */[/code]
If you want to do the memory allocation stuff by hand, you would just change the declaration for myString to this:
[code lang=”c”]char *myString = NULL;
myString = (char *)malloc(100 * sizeof(char));[/code]
Just creating a string and putting some characters inside isn’t such a big deal. You have to be able to do all sorts of stuff with it for this to be useful, things like copy them around, splitting them up (getting sub-strings from a bigger string), putting them back together (concatenating two strings into a single one), searching for things inside them. Fortunately, the creators of the C standard library thought of us and provided functions that do all of these things and more. All you have to do to use them is:
[code lang=”c”]#include <string.h>[/code]
We’ll look in detail at some of these functions and, in the spirit of learning by doing, we’ll also try to provide our own implementations for them.

strlen() – string length

The first thing we want to know about a string is it’s length, how many characters does the string contain. The function that does this for us is:
[code lang=”c”]#include <string.h>

size_t strlen(const char *s);[/code]
strlen returns the length of the string not including the terminating NULL character. For the string we defined earlier for example,
[code lang=”c”]strlen(myString);[/code]
would return 5 (go ahead, count the characters if you don’t believe me).

Let’s see how we would go about implementing this ourselves. Considering how we store the string, the only way of getting the length is by going through it character by character until we find the terminating NULL. Something like this:
[code lang=”c”]size_t my_strlen_v1(const char *s)
{
int length = 0;
const char *position = s;

while(*position != ‘\0’)
{
++length;
++position;
}

return length;
}[/code]
So what happens here? The prototype is the same as for the standard strlen() (except for the name, of course). We declare a variable that will hold our length and a temporary pointer to char that we’ll use to walk through the string. The whole job is done by the while statement: it increments the length and the position in the string until the value at the current position is the terminating NULL character (or ‘’). When it reaches this character, the while finishes and we return the length.
You can write a small test program to check if it actually works. Compiling and running the code below should display 5.
[code lang=”c”]#include <stdio.h>

size_t my_strlen_v1(const char *);

int main()
{
char myString[] = "hello";

printf("%dn", my_strlen_v1(myString));

return 0;
}

size_t my_strlen_v1(const char *s)
{
int length = 0;
const char *position = s;

while(*position != ‘\0’)
{
++length;
++position;
}

return length;
}[/code]
This is all cool and nice, but it’s a lot of code for such a small task. Let’s see if we can make it shorter and more concise:
[code lang=”c”]size_t my_strlen_v2(const char *s)
{
int length = 0;

while(*s++ != 0 && ++length);

return length;
}[/code]
In version 2 of our own strlen function we keep the length variable but we trash the pointer we used to walk the string. The whole job is again done by the while statement, which now looks a little bit weird so let’s take it apart and see how it works.
The condition is made up of two sub-conditions: *s++ != 0 and ++length. For the first one you have to know the operator priorities and how the post-increment operator (++) works: it checks if the value at address s is different than the null terminator (*s != 0) and then moves the s pointer to the next character in the string (s++), in this order. The second ‘condition’ just increments the length by 1 (the value of the condition is the new value of length which, if the expression is evaluated at least once, will never be zero). Because of the short-circuiting in evaluating conditional expressions, if the first sub-expression is false then the whole expression evaluates to false and the second sub-expression doesn’t get evaluated (so the while loop stops when reaching the terminating null without counting it also).
The last thing to notice here is the ‘;’ after the while: the loop doesn’t contain any statements, everything is done through the conditional expression.

strcmp – string compare

The next thing we want to do with C strings is compare them and see if they are the same or which of the two strings is larger or smaller. For this we use the strcmp function which looks like this:
[code lang=”c”]#include <string.h>

int strcmp(const char *s1, const char *s2);[/code]
It has two parameters, the two strings to compare, and returns an int with these possible values:

  • 0 – the two strings are identical
  • <0 – the first string is less than the second
  • >0 – the first strings is greater than the second

Note that the definition says smaller than zero and greater than zero, not -1 and 1. Even though your particular implementation returns -1 and 1, this is not a requirement, you might find an implementation that works differently, so it’s important to code with this in mind. We’ll see later, when we implement our own version of strcmp why not having this restriction is important.

But first, let’s see and example of how we would use this function:
[code lang=”c”]#include <stdio.h>
#include <string.h>

int main()
{
char *firstString = "abcd";
char *secondString = "bcde";
char *thirdString = "pqrs";
char *forthString = "ab";
char *fifthString = "Abcd";

int retValue = strcmp(firstString, firstString);
printf("strcmp("%s", "%s") = %d n", firstString, firstString, retValue);

retValue = strcmp(firstString, secondString);
printf("strcmp("%s", "%s") = %d n", firstString, secondString, retValue);

retValue = strcmp(firstString, thirdString);
printf("strcmp("%s", "%s") = %d n", firstString, thirdString, retValue);

retValue = strcmp(firstString, forthString);
printf("strcmp("%s", "%s") = %d n", firstString, forthString, retValue);

retValue = strcmp(firstString, fifthString);
printf("strcmp("%s", "%s") = %d n", firstString, fifthString, retValue);

return 0;
}[/code]
The output I get by compiling and running the above code is this:
[code lang=”bash”]strcmp("abcd", "abcd") = 0
strcmp("abcd", "bcde") = -1
strcmp("abcd", "pqrs") = -1
strcmp("abcd", "ab") = 1
strcmp("abcd", "Abcd") = 1[/code]
First note that my implementation is one of those returning -1 and 1. Then note that the comparison is done based on the ASCII codes of the characters. The second comparison for example says that “bcde” is bigger than “abcd” because the code of ‘b’ is bigger than the code of ‘a’. Also look at the last comparison: “abcd” is bigger than “Abcd” because the code for ‘a’ is bigger that the code for ‘A’. The last thing to note is that a string is bigger that any of it’s prefixes – see the forth comparison that shows that “abcd” is bigger than “ab”.

Now it’s time to build our own version of strcmp. The main idea we’ll follow is to walk the strings character by character and compare the corresponding characters to see if they’re equal or which one is bigger. We’ll start with two versions of the function, one which returns -1 and 1 and one that follows the general rule of returning negative or positive values. The first and longest one in code looks like this:
[code lang=”c”]int my_strcmp_v1(const char *s1, const char *s2)
{
while(*s1 != ‘\0’ && *s2 != ‘\0’)
{
// if the current character in s1 is smaller then the current
// character in s2, s1 is smaller then s2 so we return -1
if(*s1 < *s2)
{
return -1;
}

// if the current character in s1 is bigger than the current
// character in s2, s1 is bigger then s2 so we return 1
if(*s1 > *s2)
{
return 1;
}

// the current characters are identical, move to the next
++s1;
++s2;
}

// if we made it to this point it means s1, s2 or both are at the end
// so we check in which of these situations we are
if(*s1 == ‘\0’ && *s2 == ‘\0’)
{
// both are at the end, this means the strings are identical
return 0;
}
else if(*s1 == ‘\0’)
{
// s1 is at the end, s2 is not, this means s1 is smaller than s2
return -1;
}
else
{
// the only other situation is s2 is at the end and s1 is not
// this means s1 is bigger than s2
return 1;
}
}[/code]
The comments should be detailed enough to explain what happens: basically we walk both strings at the same time and at each step we compare the current characters and take appropriate actions. If you take the first example in this section and replace all calls to strcmp() with calls to my_strcmp_v1() you will get the exact same results as before.

Now let’s see a version that takes advantage of the fact that we’re not obligated to return -1 or 1 if the strings don’t match:
[code lang=”c”]int my_strcmp_v2(const char *s1, const char *s2)
{
while(*s1 != ‘\0’ && *s2 != ‘\0’ && *s1 == *s2)
{
++s1;
++s2;
}

return (*s1 – *s2);
}[/code]
Here we advance in both strings at once if the current characters are not the terminating NULL and they are equal. The while statement will finish when one of these conditions fail, at which moment we return the difference of the current characters codes. But is this correct? Let’s see: when the while statement finishes we’re in one of the following cases:

  1. *s1 == '' and *s2 != ''
  2. *s1 != '' and *s2 == ''
  3. *s1 == '' and *s2 == ''

In the first case we’ll return (0 - *s2) which is a negative number signifying that s1 is smaller than s2, which is correct.

In the second case we’ll return (*s1 - 0) which is a positive number signifying that s1 is bigger than s2, which again is correct.

In the third case both strings are at the end and we know all characters so far have been equal (because of the while condition). This means the strings are equal and we state this correctly by returning (*s1 - *s2) = (0 - 0) = 0.

You can test this second version of the function by replacing all calls to strcmp() in the first example with calls to my_strcmp_v2(). By compiling and running the resulting code you will get an output similar to this:
[code lang=”bash”]my_strcmp_v2("abcd", "abcd") = 0
my_strcmp_v2("abcd", "bcde") = -1
my_strcmp_v2("abcd", "pqrs") = -15
my_strcmp_v2("abcd", "ab") = 99
my_strcmp_v2("abcd", "Abcd") = 32[/code]
As you can see, the exact values are different than before but the sign is the same. So by not having to return just 3 values (-1, 0, 1) we were able to make our function much more concise.

strcpy – string copy

So far the functions we’ve seen did something with the strings received as parameters without modifying them in any way. The first action we do that modifies the strings is to copy one string to another. For this we use the strcpy function that looks like this:
[code lang=”c”]#include <string.h>

char *strcpy(char *destination, const char *source);[/code]
What it does is copy the string pointed to by source, including the '' terminator, into the array pointed to by destination. It returns a pointer to the destination string.

There are two important aspects to be careful at when using this function. The first is that the two strings, source and destination, must not overlap. The second and most important is to always make sure that the destination string is big enough to hold the source string and it’s terminating ''. strcpy can not and does not make any checks regarding the lengths of the strings or sizes of allocated memory for strings.

These being said, let’s see strcpy in action:
[code lang=”c”]#include <stdio.h>
#include <string.h>

int main()
{
char srcString[] = "source string";
char destString1[30] = "";
char destString2[30] = "original string";

printf("nOriginal strings:n");
printf("srcString = %sn", srcString);
printf("destString1 = %sn", destString1);
printf("destString2 = %sn", destString2);

strcpy(destString1, srcString);
strcpy(destString2, srcString);

printf("nStrings after copy:n");
printf("srcString = %sn", srcString);
printf("destString1 = %sn", destString1);
printf("destString2 = %sn", destString2);

return 0;
}[/code]
If you compile and run the above sample you will get an output similar to this one:
[code lang=”bash”]Original strings:
srcString = source string
destString1 =
destString2 = original string

Strings after copy:
srcString = source string
destString1 = source string
destString2 = source string[/code]
As you can see, the source string was copied into the destination, replacing what was originally there, if anything. That’s all there is to it.

Let’s see if we can implement this ourselves. All we have to do is walk the second string until we get to the terminating '' and copy each character into the corresponding position in the destination string:
[code lang=”c”]char *my_strcpy_v1(char *dest, const char *src)
{
char *original = dest;

while(*src != ‘\0’)
{
*dest = *src;
++dest;
++src;
}

*dest = ‘\0’;
return original;
}[/code]
Right at the start of our function we make a backup copy of the destination pointer because we’ll modify the original later and we need it because we must return it. Then, in the while statement, we copy all characters from the source string into the destination string and increment both source and destination pointers (move to the next character). After the while finishes we add the terminating '' to the destination string (this being the while condition, it was not copied inside the while). And that’s it, we’re done, we return the original pointer to destination we saved earlier.

You can test it by replacing all calls to strcpy() in the first sample with calls to our own implementation, the output you get should be identical to the original.

If you want, you can achieve the same result in a shorter and more concise way (as usual, this was only version 1 of the function). Go ahead, try to do this yourselves. It’s about time you got to do some actual work.

strcat – string concatenation

The last thing we want to do with C strings is put two of them together – concatenate two strings into a single, bigger string. For this task we can use the strcat function:
[code lang=”c”]#include <string.h>

char *strcat(char *destination, const char *source);[/code]
The prototype looks the same as strcpy. The difference between these two is that strcat doesn’t overwrite what was already in the destination string, it appends source at the end. The position where it starts to append is the one with the terminating ''. This terminator is replaced by the first character of source then the rest of the characters of source are appended until the source’s '' terminator, which also gets appended to the destination.

The two important things we mentioned for strcpy apply also to strcat: the two strings must not overlap and the destination must be big enough to accommodate the new, bigger string.

As usual, we’ll start with an example and then we’ll build our own implementation for strcat.
[code lang=”c”]#include <stdio.h>
#include <string.h>

int main()
{
char srcString[] = "source string";
char destString1[60] = "";
char destString2[60] = "original string";

printf("nOriginal strings:n");
printf("srcString = %sn", srcString);
printf("destString1 = %sn", destString1);
printf("destString2 = %sn", destString2);

strcat(destString1, srcString);
strcat(destString2, srcString);

printf("nStrings after copy:n");
printf("srcString = %sn", srcString);
printf("destString1 = %sn", destString1);
printf("destString2 = %sn", destString2);

return 0;
}[/code]
You most likely noticed this is the same example as the one for strcpy. The only two differences are: we replaced the calls to strcpy with calls to strcat (of course), and we made the destination strings bigger to make sure there is enough room for the new and longer strings they’ll have to hold. If you compile and run this code you will get an output like this:
[code lang=”bash”]Original strings:
srcString = source string
destString1 =
destString2 = original string

Strings after copy:
srcString = source string
destString1 = source string
destString2 = original stringsource string[/code]
As expected, the source string was added at the end of what was already in the destination string.

Now let’s see how we would implement this:
[code lang=”c”]char *my_strcat(char *dest, const char *src)
{
char *original = dest;

while(*dest != ‘\0’)
{
++dest;
}

my_strcpy_v1(dest, src);

return original;
}[/code]
Like when we implemented strcpy, we make a copy of the original destination pointer. Then we move inside the destination string until we get to the terminating '' character. Once we are on this position in the destination, we just reuse what we have done already and just call our implementation of strcpy. Remember that when we do this, dest points to the end of the destination string so we don’t overwrite it, we copy the source right after it. You can check to see if we did a good job by replacing the calls to the standard strcat() with calls to our version in the sample above. The output you get should be identical to the original.

This was just a short overview of some of the functions for C string manipulation, the most used ones. There will be a new episode talking about what can go wrong when using these and ways to avoid common problems.

One thought on “C Strings”

Leave a Reply

Your email address will not be published. Required fields are marked *