Optimizing Speed

Pages: 123

then surely it should be possible for experienced programmers to optimize the STL.

What makes you think they haven't?

What I think is that if any programmers have spent that much time optimizing the STL they have kept it propriety and have not shared it with the development communty, at least not as GPL.

exiledAussie (131)

There are some normal optimizations that haven't (I think) been mentioned yet:

1. Initialize variables rather than assigning them
2. Use pre-increments rather than post-increments
3. Do not call functions inside loops... including regarding the limit size. ie use:

1
2

int l=X.size();
for (int n=0;n<l;++n)

4. Use iteration rather than recursion. (OK, that was mentioned!)
5. Unroll loops. (Though I have found this to be hit and miss, there is much it depends on.)
6. Use the best container for the job. This does not always mean the container that has the best big O time: If you have only small collections, vectors often outperform more complex containers even when inserting and erasing.
7. Calculate using ints rather than doubles if you can, and use division sparingly.
8. If you are using mainly multipication and division, use log values.

exiledAussie (131)

Oh: And if you are using matrices, you may want to check out uBlas from boost. I can't guarantee it'll beat your code, but I suspect it will if you are dealing with large matrices.

Browni3141 (578)

What about using dynamic data structures. How can I improve speed when using these. I'm making sure to delete nodes after I'm done with them.

I don't think anyone has actually said how to get a profiler. Could you point me to one?

Any optimizations you know of would be helpful as I have loops that may iterate several billion times. Maybe even up to a trillion.

closed account (z05DSL3A)

I don't think anyone has actually said how to get a profiler. Could you point me to one?

http://en.wikipedia.org/wiki/List_of_performance_analysis_tools

closed account (Lv0f92yv)

Loops iterating billions of times... I suppose the first order of business would be finding out what operations seem to be taking the most time while inside those loops (as people have said previously by using a profiler), and from there figuring out alternative ways to accomplish the same thing. That's the hard part - but you may find that some of the things you're doing inside those loops is unnecessary. Perhaps if you post the code here, we can give you an idea of things that you don't need to be doing over and over.

Edit:

There are also bitwise operators that are really efficient, and when used in the right place, can accomplish tasks other operators accomplish, but with less time consumption. One would be using exclusive or to set a value to 0:

value ^= value; sets the value to 0.

I am not solid enough yet with bitwise ops to say for sure when it is better to use them in place of 'normal' operators, but I thought I'd put that on the table.

Last edited on

rapidcoder (1010)

2. Use pre-increments rather than post-increments

Nooo! This depends on the certain situation. Sometimes they are the same, sometimes postincrements can be faster, sometimes they can be slower. Really, no rule of thumb here.

5. Unroll loops

Onlly if you are sure, your compiler haven't already done it, and your loop is short and simple.
Otherwise it is just making your code less readable.

Another hint from me:
9. Avoid dynamic memory allocation / deallocation. new and delete are slow (>100 CPU cycles per call). The memory reference locality may also get worse with time, so this is not the only cost. Pool objects or use specialized allocation from contiguous memory fragment, if you need to allocate lots of objects. This techique is especially useful in games - all objects required to render a frame are allocated from the contiguous region and at the end they are freed by one single delete.

Last edited on

closed account (Lv0f92yv)

new and delete are slow

By contiguous region, could you elaborate on how this would be done? I have been wondering about this myself - this gets me wondering if it would be more time efficient in large collections to store references to all the new'd (allocated) objects and just go through the stored references at program end and delete everything at once, rather than deleting things when they go out of scope right away.

Galik (2254)

GCC comes with its own profiler. Just compile and link with -pg flag and run gprof on the resulting output file.

Browni3141 (578)

rapidcoder said:

Avoid dynamic memory allocation / deallocation. new and delete are slow (>100 CPU cycles per call).

I'm not sure it would be possible for me to create a program with the desired intentions without dynamic data structures. If you're right then I'm probably using several trillion CPU cycles just by using new and delete. I'm not sure how to get around this though. I can't run my program without deleting nodes or I will surely run out of memory.

Galik (2254)

You can overload the new operator and get very fast memory allocation by handing out sections from a block.

Browni3141 (578)

Galik, can you elaborate? I'm fairly new to C++; I've only been learning for 2-3 months.

I've got another quick question.
This is the node structure I'm using for my data structure. I want to be able to set the size of deck[] to size, but my compiler won't let me have a variable sized array as part of a structure. How can I fix this?

short size;

struct node
    {
        short deck[97];
        short index;
        node *next;
        node *previous;
    };

Edit: One more thing. I noticed that my program's CPU usage never breaks 50%. Why is this, and how can I use the computer's full resources?

Last edited on

Galik (2254)

Use std::vector<short> rather than an array: http://www.cplusplus.com/reference/stl/vector/

you can add functions to your struct that override the new and delete operators:

struct node
 {
    std::vector<short> deck;
    short index;
    node* next;
    node* prev;

    // the compiler will now call these functions 
    // when you creat or delete objects of type node
    void* operator new(size_t s);
    void operator delete(void* p, size_t s);
};

You can arrange it so that your version of new hands out individual chunks of memory taken from a larger lump. That way you only need to call the system memory allocator when you run out of individual pieces. But its not particularly trivial.

Last edited on

closed account (Lv0f92yv)

Sounds like a really interesting idea, I'd be interested to see a more complete implementation, and how you specify where the new memory comes from.

Galik (2254)

I have never used it myself, but Stroustrup talks about it in his book The C++ Programming Language. You can tailor the method to fit the particular situation. One very fast method is called the "arena allocator". It requests a large chunk of memory from the system at the beginning and then hands out successive addresses from within that chunk with each call to new. The delete does absolutely nothing. Only when all of the allocated objects are no longer needed is the entire chunk of memory deleted. This is only applicable in certain situations of course. Perhaps when you have a single object that manages many little objects so it has knowledge of the life-cycles of all of them.

closed account (Lv0f92yv)

Very interesting - I need to get that book.

That was a good explanation, I understand how that works now. Maybe I will try doing something like that.

jimc (62)

> I noticed that my program's CPU usage never breaks 50%.

Are you running a single threaded app on a dual core processor (one processor left idle)?

Stroustrup's book is a bit confusing, at least the version of it that I have is. The index in my version is horrible also.

Allocators can be difficult to write and get right. It can make a huge difference in performance though, if you are going to be allocating and deleting a lot of small objects.

It is best to avoid such structures since having individual nodes that can be anywhere in memory causes a lot of performance loss due to lack of locality. It can be improved by using a hand written allocator, since these grab larger chunks, all of the individual nodes are allocated within that chunk and are more likely to be cached. There is of course the overhead for the actual allocation also.

In modern processors, memory access time can kill performance. With linked list or tree nodes, the next node could be anywhere, probably not in the current cache line in the processor, possible not in the open pages in the ram which causes a huge increase in latency. For programs with a large memory footprint, it could even be in a page that has been written out to disk. It is best to avoid such pointer chasing if you can; every level of indirection means another memory access to get to the data you want. Predictable access patterns can get prefetched, and the most predictable pattern is linear access.

Almost any high performance implementation is not going to use a linked list (as you appear to be using), unless you are only accessing at the front or back of the list. A binary tree takes up the same amount of space per node, and allows for much faster search. A simple binary tree is very easy to write, but a balanced tree is significantly more complicated. To get better than a binary tree you have to go to a hash table, which is not available in the standard library. STL maps and sets are binary trees, not hash tables.

Browni3141 (578)

Are you running a single threaded app on a dual core processor (one processor left idle)?

Yes, I didn't know before, but I do have a dual core processor. Do I have to use multi-threading in order to use both CPU cores?

The structure I'm using isn't a standard linked list. It really functions more like a stack. The way I'm implementing it is also similar to a tree.

wtf (670)

I read this on a different site. That if you unsync cout and cin with stdout and stdin it will result in performance increase.

To do that:

std::ios::sync_with_stdio(false);

Problem with that is after I have called std::ios::sync_with_stdio(false); I am unable to subsequently call std::ios::sync_with_stdio(true);. Rather, such a call will fail. Does anyone know why I am unable to re-sync c++ streams with c stdio?

I suppose if you were not to mix c and c++ code it wouldn't be a problem, but having the c & c++ input/output operations in sync is what keeps

#include <iostream>
#include <cstdio>

using namespace std;

char* World(){char * temp = new char[6]; temp = "World!";  delete [] temp; return temp; }
string H = "Hello ";
char * W =  World();

int main()
{
    cout << H;
    printf("%s ", W);
    delete [] W;
    return 0;
}

from outputting as World! Hello

Last edited on

Pages: 123