Threads and mutexes

Pages: 12
Okay, so I have a blit (the process of copying a bitmap to another) function. It works fine, but there's something I don't like about it.
See, in order to speed things up I have it run on n threads (n being the number of CPUs in the system), n-1 threads being created inside the call and then killed once they're done. That works fine, but I have two concerns:
1. I'm not sure what the time overhead of creating a thread is. To mitigate this, I don't run it threaded if the surface to blit is smaller than 10000 px2, but I'd still like to remove any possible overhead.
2. On Linux, and presumably other Unices as well, creating this many threads makes the average new PIDs after the program finishes rise to the tens of thousands.

So, what I did once was create n threads and leave them there. When a function had to do something threaded, it sent a function pointer and a void * to a global object. The object then set a couple thing and unlocked a mutex one of the threads was waiting for.
This didn't work very well on Windows. I noticed that even the smallest blits (beck back then there was no lower surface limit) were taking 10 (or maybe 16, I can't remember) ms.

So the question is: what's a good way to have a thread waiting (that is, not using CPU time) for a signal from a different thread and respond to it as close to instantaneously as possible?
Last edited on
I have done my share of threading, but I have never been this picky before. :D LOL!

Because I haven't been this picky, I cannot really answer your question. But, I am bound to ask if you have tested other synchronization objects, like the goold old event object. Mutexes are system-wide global objects, while events are per-process objects. Getting or releasing a mutex from the atomic table is probably more time and resource consuming than setting or resetting an event.

I'm sure you are better at coding than I am, so I won't suggest how to implement any of this, but drop a line if you need more info or something.
In my experience, going over 30 threads or so slows your app. In Unix and Windows, the kernel supported locks (semaphores and the like) are known to the scheduler and are extremely efficient (i.e. no CPU spinning). That's assuming you're using kernel threads rather than user threads.
In my experience, going over 30 threads or so slows your app.
Did I forget to mention the threads are destroyed when they finish? There are never more than n threads doing any work, in my current code.

I was using whatever mutexes SDL provides with whatever threads SDL provides, and I was wasting almost 10 ms during each call to my function waiting for the mutex to realize it could lock. Sure, not having a thread use CPU while it does nothing is important, but I'm not going to decaplicate the average call time for it.

Well, I just finished throwing this together. It works, but I'm not entirely comfortable with it. I'm doing some nasty things.
I does have a very small overhead. For 2 cores, the average time overhead is of around half a millisecond.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
#include <iostream>
#include <vector>
#include <windows.h>

typedef unsigned long ulong;

DWORD WINAPI runningThread(void *);

typedef void (*fp)(void *);

struct Thread{
    HANDLE handle;
    ulong index;
    volatile fp function;
    void *parameter;
    volatile bool destroy;
    Thread():initialized(0){}
    ~Thread();
    void init(ulong index);
    void call(fp f,void *p);
    void wait();
private:
    bool initialized;
};

struct ThreadManager{
    std::vector<Thread> threads;
    ulong freeThreads;
    ThreadManager(ulong cpu_count);
    ulong call(fp f,void *p);
    void wait(ulong index);
    void waitAll();
};

DWORD WINAPI runningThread(void *p){
    Thread *t=(Thread *)p;
    while (1){
        while (!t->function);
        if (t->destroy)
            break;
        t->function(t->parameter);
        t->function=0;
    }
    return 0;
}

void Thread::init(ulong index){
    this->initialized=1;
    this->index=index;
    this->handle=CreateThread(0,0,(LPTHREAD_START_ROUTINE)runningThread,this,CREATE_SUSPENDED,0);
    this->function=0;
    this->parameter=0;
    this->destroy=0;
}

Thread::~Thread(){
    if (!this->initialized)
        return;
    this->destroy=1;
    this->function=(fp)1;
    ResumeThread(this->handle);
    WaitForSingleObject(this->handle,INFINITE);
    CloseHandle(this->handle);
}

void Thread::call(fp f,void *p){
    this->function=f;
    this->parameter=p;
    ResumeThread(this->handle);
}

void Thread::wait(){
    while (this->function);
    SuspendThread(this->handle);
}

ThreadManager::ThreadManager(ulong cpu_count){
    this->threads.resize(cpu_count-1);
    for (ulong a=0;a<this->threads.size();a++)
        this->threads[a].init(a);
    this->freeThreads=this->threads.size();
}

ulong ThreadManager::call(fp f,void *p){
    long ret=-1;
    if (!this->freeThreads){
        while (ret<0){
            for (ulong a=0;a<this->threads.size() && ret<0;a++)
                if (!this->threads[a].function)
                    ret=a;
        }
    }else{
        for (ulong a=0;a<this->threads.size() && ret<0;a++)
            if (!this->threads[a].function)
                ret=a;
        this->freeThreads--;
    }
    this->threads[ret].call(f,p);
    return ret;
}

void ThreadManager::wait(ulong index){
    this->threads[index].wait();
    this->freeThreads++;
}

void ThreadManager::waitAll(){
    for (ulong a=0;a<this->threads.size();a++)
        this->wait(a);
}
Last edited on
In wait(), you should call WaitForSingleObject or WaitForMultipleObjects.

waitAll() should call WaitForMultipleObjects. You can wait for thread handles as they're synchronisation objects. It depends what you want to trigger on.

I wouldn't use SuspendThread.

See: http://msdn.microsoft.com/en-us/library/ms686967%28VS.85%29.aspx


In my opinion, I would create one thread per core, not "minus one". But that's just me because most of the time I have a GUI that the OS manages to keep responsive enough (Vista and 7; XP not so good). I guess you are being considerate with the rest of the PC and running applications.

So, what you do in your main thread is create a ThreadManager and use that one to hold Thread instances. Cool there.

I then suppose that your main thread calls ThreadManager::call() to start some work. This in turns calls for Thread::call() with the provided function pointer and data pointer. Thread::call() resumes the associated thread. At this point I start to see where the nasty is. Your worker thread is in an infinite loop (runningThread) that will spike the processor core until ThreadManager::wait() is called by the main thread whenever it senses the work has been done.

Besides that, to properly terminate the thread you need to provide a function pointer AND set destroy to not-zero, and you have to do them in reverse order or you will generate a GPF for trying to call a bogus function pointer.

Man, you went into trouble with this code. Properly used I guess it works just fine, but boy, if a newbie gets his/her hands in this it will be messy. :-S

I would not have done it like that, but again, 99.99% sure I wouldn't have timed the thread activation time.

Instead of Thread::destroy and a bogus function pointer, I would have had a HANDLE Thread::go and the bool Thread::destroy; Thread::go would have been initialized with CreateEvent() (auto reset), and runningThread would have had a WaitForSingleObject(t->go, INFINITE). Once the waiting is done, I would have processed the function, if there is one, and I would have exited the thread if destroy was set to true. This way I still have two control variables for exiting, but it is safer because I don't have to lie about a function pointer.

But again, I don't know about the performance for that one.

Oh, and my gain with CreateEvent() in auto reset mode is that I don't need a wait() function call. The thread will automatically pause. In this very aspect, my code would have beaten yours. :-)
Last edited on
In my opinion, I would create one thread per core, not "minus one". [...] I guess you are being considerate with the rest of the PC and running applications.
n-1 because the main thread also runs an instance of the function. Otherwise, I'm creating one more thread and there's another thread just waiting for the others to finish.

Here's what my main() look like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
void thread(void *p){
    ulong b=0;
    for (ulong a=0;a<1000000;a++)
        b+=a;
    *(ulong *)p=b;
}

#define N 10000

int main(){
    ulong cpu_count=1;
    {
        SYSTEM_INFO si;
        GetSystemInfo(&si);
        cpu_count=si.dwNumberOfProcessors;
    }
    ThreadManager tm(cpu_count);
    ulong t0,t1;
    t0=GetTickCount();
    ulong d;
    for (ulong a=0;a<N;a++){
        tm.call(thread,&d); //Note: this line assumes the system has only 2
                            //cores. In reality, there should be a for here.
        thread(&d);
    }
    tm.waitAll();
    t1=GetTickCount();
    double t2,t3;
    t2=double(t1-t0)/double(N);
    t0=GetTickCount();
    for (ulong a=0;a<N;a++)
        thread(&d);
    t1=GetTickCount();
    t3=double(t1-t0)/double(N);
    std::cout <<t2<<" ms/call"<<std::endl;
    std::cout <<t3<<" ms/call"<<std::endl;
    std::cout <<"Threading overhead: "<<t2-t3<<" ms."<<std::endl;
    return 0;
}

I measured it again and the overhead is ~50 us. >:-)

At this point I start to see where the nasty is. Your worker thread is in an infinite loop (runningThread) that will spike the processor core until ThreadManager::wait() is called by the main thread whenever it senses the work has been done.
The nasty part is that I'm writing to unlocked shared variables. It works, but is it guaranteed to work? Beats me.
There's no problem with the tight loop. The way I'm going to use it is a function will first occupy all threads and then wait for all threads. Using it in any other way could lead to unexpected behavior. This is only acceptable because I wrote it for a very specific purpose.

you have to do them in reverse order or you will generate a GPF for trying to call a bogus function pointer.
It only looks that way. notice that I'm doing the assignment while the thread is suspended. I just noticed, though, that I'm not making sure the thread isn't suspended.

Properly used I guess it works just fine, but boy, if a newbie gets his/her hands in this it will be messy.
No code used internally (e.g. this doesn't apply to libraries) by any program needs to account for the programmer being stupid. Just make sure you document the proper usage and be on your way.

I'm changing my code to events and see how well it performs. If it doesn't perform much worse, I'll probably switch to it permanently.
One good thing about using events (which is probably why I didn't notice right away your unlocked access to shared data) is that, properly used, implicitly synchronize data access.

you have to do them in reverse order or you will generate a GPF for trying to call a bogus function pointer.
It only looks that way. notice that I'm doing the assignment while the thread is suspended. I just noticed, though, that I'm not making sure the thread isn't suspended.


I think that it has to be in reverse order if you destroy the Thread struct without calling wait() first. Maybe you have this covered in your main thread's code, maybe not. It doesn't show and that is why I say it has to be done in the exact order.
Last edited on
Why are you trying to accomplish this time reduction, by the way, and if you don't mind my asking? Beat Photoshop??? :-P
Last edited on
This is what I propose to be the runningThread() function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
DWORD WINAPI runningThread(void *p){
    Thread *t=(Thread *)p;
    DWORD dwTimesUsed = 0;
    while (1){
        WaitForSingleObject(t->go, INFINITE);
        if (t->function)
        {
            t->function(t->parameter);
            t->function=0;
            dwTimesUsed++;
        }
        if (t->destroy)
            break;
    }
    return dwTimesUsed++;
}


I added a counter to see how many times the thread is used. Can be picked up by retrieving the thread's exit code, but if you prefer, you can add it to the Thread structure. And of course, if you find value in it. I could become interesting data that can help you decide if your approach is useful/used.
There's a text output function in the program. It puts characters on the screen one at a time. Character usually has a drop shadow (just the same character in black with an offset), and each character has to be blitted to the real screen and to a surface that will hold the contents of the current text screen. That's four calls to the blit function per character. If each call takes 10 ms, to display 1000 characters (more or less a single screen full of text) it would take 40 seconds. My fastest code can do the same almost instantly, which means a ridiculous amount of that time was being spent locking mutexes. My guess is that this is how SDL implements them:
1
2
while (is_locked)
    Sleep(1);

Sleep() is unable to wait less than 10 ms.

These times seem insignificant, but when you need to do an operation a lot of times in a very short amount of time, you'll really feel the difference.


Here's what the function looks like right now (I'm not done yet and I'm still trying out events, so it may not be right):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
DWORD WINAPI runningThread(void *p){
	Thread *t=(Thread *)p;
	while (1){
#if VERSION==0
		while (!t->function);
#elif VERSION==1
		WaitForSingleObject(t->startCallEvent,INFINITE);
#endif
		if (t->destroy)
			break;
		t->function(t->parameter);
		t->function=0;
#if VERSION==1
		ResetEvent(t->startCallEvent);
		SetEvent(t->callEndedEvent);
#endif
	}
	return 0;
}
I see what you mean. Sounds like you need it fast. Question: Can't you blit all on-screen characters at once? Or better yet: Use DrawText() over a memory device context with the appropriate background. First DrawText() in black and with the offset, then DrawText() in the color of the text and without offsetting it. Then blit the result to screen. I don't need code or an elaborated answer; it is more of a question to make you to think of that possibility, if you haven't already.

As for your current version: If you create the startCallEvent event as an auto-reset event, then you can suppress your call to ResetEvent() because WaitForSingleObject() will reset the event at the time it returns. Unless you need the event signaled for something else (it doesn't appear to be the case), auto reset is the way to go.
Can't you blit all on-screen characters at once?
No, because each character has to be displayed individually, and because that's not how the rendering engine works. Plus there's a lot of code involving line-wrapping and a whole bunch of other complex stuff. Bottom line: I can't change how text is displayed.

Alright, I changed startCallEvent to autoresetting.
The following has an overhead identical to the last version (30-100 us), so I guess I'll keep it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
#include <iostream>
#include <vector>
#include <windows.h>

#define VERSION 1

typedef unsigned long ulong;

DWORD WINAPI runningThread(void *);

typedef void (*fp)(void *);

struct Thread{
	HANDLE handle;
#if VERSION==1
	HANDLE startCallEvent,
		callEndedEvent;
#endif
	ulong index;
	volatile fp function;
	void *parameter;
	volatile bool destroy;
	Thread():initialized(0){}
	~Thread();
	void init(ulong index);
	void call(fp f,void *p);
#if VERSION==0
	void wait();
#elif VERSION==1
	void wait(ulong timeout=INFINITE);
#endif
private:
	bool initialized;
};

struct ThreadManager{
	std::vector<Thread> threads;
	ulong freeThreads;
	ThreadManager(ulong cpu_count);
	ulong call(fp f,void *p);
	void wait(ulong index);
	void waitAll();
};

void Thread::init(ulong index){
	this->initialized=1;
	this->index=index;
#if VERSION==0
	this->handle=CreateThread(0,0,(LPTHREAD_START_ROUTINE)runningThread,this,CREATE_SUSPENDED,0);
#elif VERSION==1
	this->startCallEvent=CreateEvent(0,0,0,0);
	this->callEndedEvent=CreateEvent(0,1,1,0);
	this->handle=CreateThread(0,0,(LPTHREAD_START_ROUTINE)runningThread,this,0,0);
#endif
	this->function=0;
	this->parameter=0;
	this->destroy=0;
}

Thread::~Thread(){
	if (!this->initialized)
		return;
	this->destroy=1;
	this->function=(fp)1;
#if VERSION==0
	ResumeThread(this->handle);
	WaitForSingleObject(this->handle,INFINITE);
#elif VERSION==1
	this->wait();
	SetEvent(this->startCallEvent);
	WaitForSingleObject(this->handle,INFINITE);
	CloseHandle(this->startCallEvent);
	CloseHandle(this->callEndedEvent);
#endif
	CloseHandle(this->handle);
}

void Thread::call(fp f,void *p){
	this->function=f;
	this->parameter=p;
#if VERSION==0
	ResumeThread(this->handle);
#elif VERSION==1
	SetEvent(this->startCallEvent);
#endif
}

#if VERSION==0
void Thread::wait(){
#elif VERSION==1
void Thread::wait(ulong timeout){
#endif
#if VERSION==0
	while (this->function);
	SuspendThread(this->handle);
#elif VERSION==1
	WaitForSingleObject(this->callEndedEvent,timeout);
#endif
}

ThreadManager::ThreadManager(ulong cpu_count){
	this->threads.resize(cpu_count-1);
	for (ulong a=0;a<this->threads.size();a++)
		this->threads[a].init(a);
	this->freeThreads=this->threads.size();
}

ulong ThreadManager::call(fp f,void *p){
	long ret=-1;
	if (!this->freeThreads){
		while (ret<0){
			for (ulong a=0;a<this->threads.size() && ret<0;a++)
				if (!this->threads[a].function)
					ret=a;
		}
	}else{
		for (ulong a=0;a<this->threads.size() && ret<0;a++)
			if (!this->threads[a].function)
				ret=a;
		this->freeThreads--;
	}
	this->threads[ret].call(f,p);
	return ret;
}

void ThreadManager::wait(ulong index){
	this->threads[index].wait();
	this->freeThreads++;
}

void ThreadManager::waitAll(){
	for (ulong a=0;a<this->threads.size();a++)
		this->wait(a);
}

DWORD WINAPI runningThread(void *p){
	Thread *t=(Thread *)p;
	while (1){
#if VERSION==0
		while (!t->function);
#elif VERSION==1
		WaitForSingleObject(t->startCallEvent,INFINITE);
		ResetEvent(t->callEndedEvent);
#endif
		if (t->destroy)
			break;
		t->function(t->parameter);
		t->function=0;
#if VERSION==1
		SetEvent(t->callEndedEvent);
#endif
	}
	return 0;
}

void thread(void *p){
	ulong b=0;
	for (ulong a=0;a<1000000;a++)
		b+=a;
	*(ulong *)p=b;
}

#define N 10000

int main(){
	ulong cpu_count=1;
	{
		SYSTEM_INFO si;
		GetSystemInfo(&si);
		cpu_count=si.dwNumberOfProcessors;
	}
	ThreadManager tm(cpu_count);
	ulong t0,t1;
	t0=GetTickCount();
	ulong d;
	for (ulong a=0;a<N;a++){
		tm.call(thread,&d);
		thread(&d);
	}
	tm.waitAll();
	t1=GetTickCount();
	double t2,t3;
	t2=double(t1-t0)/double(N);
	t0=GetTickCount();
	for (ulong a=0;a<N;a++)
		thread(&d);
	t1=GetTickCount();
	t3=double(t1-t0)/double(N);
	std::cout <<t2<<" ms/call"<<std::endl;
	std::cout <<t3<<" ms/call"<<std::endl;
	std::cout <<"Threading overhead: "<<t2-t3<<" ms."<<std::endl;
	return 0;
}


Now I only need to translate it to UNIX. Any relevant functions are welcome.
Last edited on
Sure. All you have to do is.... er' look! Megan Fox!! I'm out.
Oh I love threading...

But who the hell is Megan Fox?xD...
This is as good as it gets on windows. Mutex is for syncing difrent procs, not only threads, so its much slower when creating and locking. Also, remove version 0 because thats just fail...
lol don't even know who megan fox is... dunno why i read this thread though, made absolutely no sense to me xD although i learned the term blit and got to see a 193 line code that makes no sense to me :P seems pretty cool though, i'd compile it and try it out but i dont feel like downloading a compiler on my desktop.
There! Surprisingly enough, VC++ comes with semaphore.h, but I did have to get pthread.h.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
#include <iostream>
#include <vector>
#include <ctime>

typedef unsigned long ulong;

typedef void (*fp)(void *);

//#define SYS_WINDOWS
//#define SYS_UNIX

#ifdef SYS_WINDOWS
#include <windows.h>
#elif defined(SYS_UNIX)
#include <pthread.h>
#include <semaphore.h>
#endif

class Event{
	bool initialized;
#ifdef SYS_WINDOWS
	HANDLE event;
#elif defined(SYS_UNIX)
	sem_t sem;
#endif
public:
	Event():initialized(0){}
	void init();
	~Event();
	void set();
	void reset();
	void wait();
};

class Thread{
	bool initialized;
#ifdef SYS_WINDOWS
	HANDLE thread;
#elif defined(SYS_UNIX)
	pthread_t thread;
#endif
	ulong index;
	volatile bool destroy;
	void *parameter;
public:
	Event startCallEvent,
		callEndedEvent;
	volatile fp function;
	Thread():initialized(0){}
	~Thread();
	void init(ulong index);
	void call(fp f,void *p);
	void wait();
#ifdef SYS_WINDOWS
	static DWORD WINAPI runningThread(void *);
#elif defined(SYS_UNIX)
	static void *runningThread(void *);
#endif
};

class ThreadManager{
	std::vector<Thread> threads;
	ulong freeThreads;
public:
	ThreadManager(ulong cpu_count);
	ulong call(fp f,void *p);
	void wait(ulong index);
	void waitAll();
};

void Event::init(){
#ifdef SYS_WINDOWS
	this->event=CreateEvent(0,0,0,0);
#elif defined(SYS_UNIX)
	sem_init(&this->sem,0,0);
#endif
	this->initialized=1;
}

Event::~Event(){
	if (!this->initialized)
		return;
#ifdef SYS_WINDOWS
	CloseHandle(this->event);
#elif defined(SYS_UNIX)
	sem_destroy(&this->sem);
#endif
}

void Event::set(){
#ifdef SYS_WINDOWS
	SetEvent(this->event);
#elif defined(SYS_UNIX)
	sem_post(&this->sem);
#endif
}

void Event::reset(){
#ifdef SYS_WINDOWS
	ResetEvent(this->event);
#endif
}

void Event::wait(){
#ifdef SYS_WINDOWS
	WaitForSingleObject(this->event,INFINITE);
#elif defined(SYS_UNIX)
	sem_wait(&this->sem);
#endif
}

void Thread::init(ulong index){
	this->initialized=1;
	this->index=index;
	this->startCallEvent.init();
	this->callEndedEvent.init();
#ifdef SYS_WINDOWS
	this->thread=CreateThread(0,0,(LPTHREAD_START_ROUTINE)runningThread,this,0,0);
#elif defined(SYS_UNIX)
	pthread_create(&this->thread,0,runningThread,this);
#endif
	this->function=0;
	this->parameter=0;
	this->destroy=0;
}

Thread::~Thread(){
	if (!this->initialized)
		return;
	this->wait();
	this->destroy=1;
	this->startCallEvent.set();
#ifdef SYS_WINDOWS
	WaitForSingleObject(this->thread,INFINITE);
	CloseHandle(this->thread);
#elif defined(SYS_UNIX)
	pthread_join(this->thread,0);
#endif
}

void Thread::call(fp f,void *p){
	this->function=f;
	this->parameter=p;
	this->startCallEvent.set();
}

void Thread::wait(){
	if (!this->function)
		return;
	this->callEndedEvent.wait();
}

ThreadManager::ThreadManager(ulong cpu_count):threads(cpu_count-1){
	for (ulong a=0;a<this->threads.size();a++)
		this->threads[a].init(a);
	this->freeThreads=this->threads.size();
}

ulong ThreadManager::call(fp f,void *p){
	long ret=-1;
	if (!this->freeThreads){
		while (ret<0){
			for (ulong a=0;a<this->threads.size() && ret<0;a++)
				if (!this->threads[a].function)
					ret=a;
		}
	}else{
		for (ulong a=0;a<this->threads.size() && ret<0;a++)
			if (!this->threads[a].function)
				ret=a;
		this->freeThreads--;
	}
	this->threads[ret].call(f,p);
	return ret;
}

void ThreadManager::wait(ulong index){
	this->threads[index].wait();
	this->freeThreads++;
}

void ThreadManager::waitAll(){
	for (ulong a=0;a<this->threads.size();a++)
		this->wait(a);
}

#ifdef SYS_WINDOWS
DWORD WINAPI 
#elif defined(SYS_UNIX)
void *
#endif
Thread::runningThread(void *p){
	Thread *t=(Thread *)p;
	while (1){
		t->startCallEvent.wait();
		if (t->destroy)
			break;
		t->function(t->parameter);
		t->function=0;
		t->callEndedEvent.set();
	}
	return 0;
}

void thread(void *p){
	static ulong d=0;
	ulong b=0;
	for (ulong a=0;a<1000000;a++)
		b+=a;
	*(ulong *)p=b;
	//std::cout <<++d<<std::endl;
}

#define N 100000

int main(){
	ulong cpu_count=2;
	ThreadManager tm(cpu_count);
	ulong t0,t1;
	t0=clock();
	ulong d;
	for (ulong a=0;a<N;a++){
		tm.call(thread,&d);
		thread(&d);
	}
	tm.waitAll();
	t1=clock();
	double t2,t3;
	t2=double(t1-t0)/double(CLOCKS_PER_SEC)*1000/double(N);
	t0=clock();
	for (ulong a=0;a<N;a++)
		thread(&d);
	t1=clock();
	t3=double(t1-t0)/double(CLOCKS_PER_SEC)*1000/double(N);
	std::cout <<t2<<" ms/call"<<std::endl;
	std::cout <<t3<<" ms/call"<<std::endl;
	std::cout <<"Threading overhead: "<<t2-t3<<" ms."<<std::endl;
	return 0;
}
Code like:
1
2
    WaitForSingleObject(t->startCallEvent,INFINITE);
    ResetEvent(t->callEndedEvent);

is a race condition. It's better to use an auto-reset event (a parameter to CreateEvent).
Look at the newest version. I removed that.
Pages: 12