C# Traps and Traps – part 2

Theory vs practice.

One of the first conclusions from learning about memory, data structures and collections was that we should always choose the collection according to our usage. The next step was to learn about all the collections so we could make an informed decision while writing our code.

An array is a collection that takes uninterrupted space in memory – and thanks to that, if we know where the first element of an array is in the memory, we can compute where the n-th element is – we just have to add n multiplied by the size of the element to the first element address and as a result, we know the address of the n-th element. The problem with an array is that when we need to expand the array (add new elements) we have to find a new place in the memory that will fit the array (including new elements), copy the memory to the new address, change the pointer to the memory and then free the old memory. The details differ between implementations but it is generally a lot of work.

Linked list, on the other hand, is great if we don’t need to have fast access to the n-th element, but we would like to add a lot of elements to the collection. A linked list does not take uninterrupted memory block, but instead, every element of the collection has knowledge of where the next (and sometimes previous) element is. This way, knowing the first element, you can get the address to next one, from there the next one and so on. When you need to access an n-th element of the Linked List it is difficult, because we need to visit n elements to finally find n-th. But it is super easy to add new elements to this collection, the only thing we need to do is to modify the previous element of the collection so that it points to the new element and makes the new element point to next element of the array. No need to copy memory, we just change pointers. Neat.

 

Knowing all this, let’s take a measurement of how much time will it take to add 100 million elements to the Array and the List. Since we need to measure the time of execution, let’s do it properly. In C# there is class Stopwatch from System.Diagnostic namespace – to do that for us. Please note that we call our code twice, once before the execution and then during measurement. The first one is for Just-In-Time compiler (JIT) to do its job. If we don’t do that, our measurement will heavily depend on how fast this is compiled and not executed – and it is not something we are interested now.

private static void MeasureAction(Action action)
{
    action(); // we call it once beforehand so that JIT can complile it before the measurement
    var stopwatch = Stopwatch.StartNew();
    action();
    stopwatch.Stop();
    Console.WriteLine(stopwatch.ElapsedMilliseconds);
}

The code we will measure is as follows: first, we create enumerable testSet with a hundred million elements, from zero counting up. Next, we create an Array with capacity of 1, and size of 0 (it can store one element but does not store anything). Finally, in the foreach loop, we resize the Array if needed (resize will double the capacity of an Array), and then add next element to the Array. Nothing fancy yet.

int testSize = 1000 * 1000 * 100; // 100 M
IEnumerable<int> testSet = Enumerable.Range(0, testSize);

Console.WriteLine($"Adding {testSize} elements to array");
Program.MeasureAction(() =>
{
     var capacity = 1;
     var size = 0;
     var arrayToMeasure = new int[capacity];

     foreach (var testItem in testSet)
     {
         if (size == arrayToMeasure.Length)
         {
             Program.ResizeArray(ref capacity, ref arrayToMeasure);
         }
         arrayToMeasure[size++] = testItem;
     }
 });

private static void ResizeArray(ref int capacity, ref int[] arrayToMeasure)
{
    capacity *= 2;
    var newItems = new int[capacity];
    Array.Copy(arrayToMeasure, 0, newItems, 0, arrayToMeasure.Length);
    arrayToMeasure = newItems;
}

We will measure the List exactly the same – create a List with capacity 1 that does not store anything and use foreach loop to populate it with elements from our enumerable. It is way easier to write because we will just use Add method from List class and we don’t need to do anything more.

Console.WriteLine($"Adding {testSize} elements to list");
Program.MeasureAction(() =>
{
    var listToMeasure = new List<int>(1);

    foreach (var testItem in testSet)
    {
        listToMeasure.Add(testItem);
    }
});

After running the example (remember to run it without debugging and in release mode), the time measure showed that adding hundred million elements to the Array took 959 milliseconds, where doing the same to the List took 1023 milliseconds. Not as expected – and here is why.
I have been living my life thinking that System.Collections.Generic.List is a list, but in fact, the collection that drives the List is an array. Initially, I felt like I have been lied to. I remember being angry when I learned the truth – why is the name so misleading, how better my programs could have been if I knew earlier. However, the List class was my most used collection so far, it is very convenient and (I thought) fast. We need to measure few more things before we can draw a conclusion.
In the same namespace as List, there is LinkedList class (no mistakes now, this is the collection for adding new elements fast!). The code for measuring is almost identical to List.

Console.WriteLine($"Adding {testSize} elements to list");
Program.MeasureAction(() =>
{
    var listToMeasure = new LinkedList<int>(1);

    foreach (var testItem in testSet)
    {
        listToMeasure.AddLast(testItem);
    }
});

After running the code, my application did not finish on its own, I had to terminate it after a few minutes. I have reduced the number of iteration 100 times (now we are at one million elements), and the result was as follow: both the Array and the List finished in 11 milliseconds, where LinkedList took 160 milliseconds. Why is that? In a LinkedList, every time we add something the memory allocation needs to be made. It is rather fast, but it adds up. In an Array, we allocate memory only when we run out of not assigned elements, then we not only have to allocate a new Array, we have to copy its content to a new place in the memory. This costs us a lot of time – this, however, is a rather rare event, because as we said earlier, we double the capacity. In the presented example of adding hundred million elements, it does happen only 27 times. You will not be surprised that the Resize method is taken from Microsoft’s List implementation – I just stripped down few lines so that mine would be faster than original (few important lines in the normal scenario).

For me, the conclusion is that I have been using the List class for wrong reasons. Now I am using it for the good reasons.

Boxing and Unboxing

In C# they are two main kinds of types – values and references. There is a number of differences between those two, but the type system is build to enable representation of any type as an Object. The Object is a class (so a reference type) from which every reference type inherits. Because value types do not inherit from Object there is a special mechanism to enable such conversion. It is called boxing and unboxing.
Boxing is an implicit operation where a value type is converted to Object or to an interface that the value type implements. It is done by allocating reference type on the heap and copying the value type into that new object. Unboxing is always explicit. First, there is a check if the instance is indeed a boxed value of given type, then the value of the instance is copied into value type variable.
To show what problems can be encountered while working with boxing and unboxing, consider following interface and a class that implements the interface.

internal interface IPoint
{
    void SetValue(int x, int y);
}

internal struct Point : IPoint
{
    public int X;
    public int Y;

    public Point(int x, int y)
    {
        this.X = x;
        this.Y = y;
    }

    public void SetValue(int x, int y)
    {
        this.X = x;
        this.Y = y;
    }

    public override string ToString()
    {
        return string.Format("({0}, {1})", X.ToString(), Y.ToString());
    }
}

Nothing suspicious here – however, please note three things. Point is a structure, it implements an IPoint interface, and ToString is overriding a method from Object class. The ToString method will only serve us as identification of what Point are we talking about, but the rest is really important for the presentation.
We start by creating the Point structure and assign it to variable p. If we pass it into Console.WriteLine it will then execute ToString method and produce the nice output: (1, 1).

Point p = new Point(1, 1);
Console.WriteLine(p);

Let’s verify that the SetValue method works properly. It does, because after running the following code we received expected output: (2, 2).

p.SetValue(2, 2);
Console.WriteLine(p);

Now we move forward to investigate boxing. We will assign p to new Object o, and verify that it prints proper output – it does: (2, 2).

Object o = p;
Console.WriteLine(o);

Now, since we have Object o that has boxed value type Point inside, let’s cast it to Point and change value inside. Please note, that while casting we will, in fact, unbox the o because it will return to an initial type Point.

((Point)o).SetValue(3, 3);
Console.WriteLine(o);

After running the code, we notice that we did not modify the o, the console will print (2, 2) again. That is unfortunate, but there is one simple fix that we can do. Because Point implements IPoint interface, we can cast Object o to IPoint and then try to modify o.

((IPoint)o).SetValue(4, 4);
Console.WriteLine(o);

After running the code we can confirm that casting to an interface does work (output is (4, 4)) – please note, that there is no unboxing here – as both Object o and interface IPoint are a reference type. If we try to take Point p and change it into IPoint interface we won’t be able to change the value of p.

((IPoint)p).SetValue(5, 5);
Console.WriteLine(p);

After running the code, console printed, as we expect, (2, 2) – because changing the reference type o did not change the value of p. The key to understanding what is wrong here is in the description of what boxing and unboxing do – it makes a deep copy. Every time we use either boxing or unboxing mechanism, new “thing” (Object or structure) is created – we then modify this new entity with the SetValue method, but it is then never used. I think the best way to fully understand this issue is to show how to fix it.

Point temporaryPoint = ((Point)o);
temporaryPoint.SetValue(6, 6);
Console.WriteLine(temporaryPoint);

Here we use an intermediate variable of type Point, we then modify it and display it. It is working as expected.

Leave a Reply

Your email address will not be published. Required fields are marked *