C strings – part II

In the last post, the one talking about C strings, I promised a sequel about common problems when doing seemingly innocent operations with them, so here it is.
The list is by no means complete. It is actually very short, just 3 or 4 points. If you care to add more, feel free.

Overlapping strings

When talking about operations with C strings, particularly strcat and strcpy, I mentioned these should not be used with overlapping strings. In case the reasons were not evident enough, we’ll look at some examples. Keep in mind how these functions work, that is the fact that they rely only on the terminating '' character for determining where a string ends.
Now let’s see a small piece of code:
[code lang=”c”]char str1[30] = { 0 };
char *str2 = NULL;

/* store a string in str1 */
strcpy(str1, "abcdef");

/* str2 will point in the middle of str1 */
str2 = str1 + 3;

printf("noriginal strings:n");
printf("str1 = %sn", str1);
printf("str2 = %sn", str2);

/* copy str2 into str1 */
strcpy(str1, str2);

printf("nafter copy:n");
printf("str1 = %sn", str1);
printf("str2 = %sn", str2);[/code]
Here we have two strings: one has some characters and the other points inside the first one (so the strings overlap). If you compile and run the above code you will get an output like the following:
[code lang=”bash”]original strings:
str1 = abcdef
str2 = def

after copy:
str1 = def
str2 =[/code]
The first string, str1, seems to get the correct value after the copy operation, that is the value that str2 had. But at the same time it looks like str2 gets modified, which shouldn’t happen because strcpy() shouldn’t modify it’s second parameter. Right?
Something even more interesting happens if in the example above you replace the call to strcpy() with a call to strcat(), which will give you the following result:
[code lang=”bash”]original strings:
str1 = abcdef
str2 = def

after copy:
str1 = abcdefdefd
str2 = defdefd[/code]
In this case both strings are completely wrong after the concatenation.

The last string to try is this: after you have replaced the strcpy() with strcat(), make str2 start at the second character of str1 instead of the third. This would make that line look like:
[code lang=”c”]str2 = str1 + 2;[/code]
Compile and run this code. What happened? Did it blow up in your face? Of course it did.

All this behavior may look strange but it has a very good reason and that’s the '' terminator. In all the above cases you end up overwriting or moving the '' to a place that messes up one or both of the strings. To see exactly what happens take a piece of paper, a pencil and start doing the steps that strcat() would do for example on those two strings (you already know how strcat() works).
In conclusion, always always always be careful not to call these functions on overlapping strings.

How much space is enough?

This is probably the worst issue of C strings and the one that generates the most problems: you never know how much memory you need to allocate for a string. Because of this you often see or do things like:
[code lang=”c”]char myString[1000];[/code]
and hope it will be enough for what you’ll put inside. And you’ll often be wrong.
Every time you do something like this you run the risk that at some point you or somebody else will go over that limit (unintentionally or, worse, intentionally).
So what can we do to avoid this problem? Well, it turns out there are a few tricks to help us. But even following these “tricks” or rules, the risk is still there and it’s huge.
The first thing you can do is avoid functions like strcpy() or strcat() and use their length-limited counterparts instead:
[code lang=”c”]char *strncpy(char *destination, const char *source, size_t length);
char *strncat(char *destination, const char *source, size_t length);[/code]
These look exactly like the normal versions except they have an extra parameter length. This tells the functions not to copy or concatenate more that length characters from the source into the destination. This helps you avoid writing more characters than the destination fits (if you compute length properly), but creates another problem: if length is smaller than the actual length of source, the copying will stop abruptly when it reaches the limit. This means the terminating '' will not be copied, leaving you with a string that’s not properly zero-terminated. If you forget to test for this case and add the '' yourself, you’re doomed.

The second thing you can do is use dynamically allocated strings and re-allocate each time you need a bigger size:
[code lang=”c”]char *str = NULL;

str = (char *)malloc((strlen("string 1") + 1) * sizeof(char));
strcpy(str, "string 1");

str = (char *)realloc(str, strlen(str) + strlen("string 2") + 1);
strcat(str, "string 2");[/code]
This has two big problems:

  • you have to be careful about the sizes you allocate or reallocate (you still have to be careful about sizes, so you’re probably not gaining much in this area);
  • allocations are slow.

How fast can you run?

We’ll leave aside all the clever “algorithms” we may write in working with strings and talk about two things that are easily and often overlooked and that might have a huge impact on speed in certain situations.
The first of them, already mentioned above, is the fact that allocations are slow. Look again at the last example involving malloc(). Now imagine you’re not dealing with 2 strings but 200, or 2000, or 200000…00 and think about how many allocations and re-allocations you will do.
But why would this be so slow? All it’s doing is giving you a pointer to some memory location, it’s not like it’s building that whole memory chip on the spot, right? Well, it turns out it’s not quite that simple. On a very basic level, malloc works like this: it’s managing a list of memory segments of different sizes that it has available, which is called a free chain. When you ask it for a new memory zone for one of your strings (or for whatever other reasons), it iterates over this free chain searching for a segment big enough for your needs. When it finds the proper segment, it (possibly) splits it in two: a part that it marks as allocated that gets returned to you and the rest (if the block was bigger than what you asked for) is reinserted in the free chain. Because of this splitting, after a while the whole list will be made up of small blocks and allocation requests will start to cause malloc to not find a big enough segment. In this case it starts to play around with the free chain and compact it, creating big blocks from smaller ones where possible. This is somewhat similar with what a garbage collector does, if you want. This is why sometimes you will get the impression that an allocation request takes much longer that usually to complete.

The second thing that’s easily overlooked comes from our old and misguided friend, the '' terminator. Let’s see an example:
[code lang=”c”]char str[1000…]; /* a gazillion char’s or whatever, don’t worry about this */

strcpy(str, "string 1");
strcat(str, "string 2");
strcat(str, "string 3");[/code]
Remember how strcpy() and strcat() work and think about what this does. It first goes over "string 1" character by character and copies them to str. Then it goes through str character by character (so it iterates again over "string 1") until in gets to the end and then goes through "string 2" character by character and copies each one to str. Then, again, it goes through str (which now holds the first two strings together) until it gets to the end and then goes through "string 3" and copies each character to str. Got the main idea? Good, now think you’re doing this for 100 strings, not just 3. How many times will it have to walk through each of those 100 strings in total? How about if you do this for 1000 or 1000000 strings?

Leave a Reply

Your email address will not be published. Required fields are marked *