I made my own string class :D

I was bored, and decided to make a string class with similar features as the strings library, with the added benefit of a few extra functions. I even went so far as adding const-correctness (if I did it right - I don't normally). Without further ado, I present my string class:

http://pastebin.com/zKsGKfHg

Any suggestions or comments appreciated ^-^
nice work man :)

Good to see people are really keen here :D

nice indeed.

in string(const char*). Any reason why 33?

in string(const string&). You don't have to memcpy the whole datasize, only datalen+1. The rest is garbage.

operator char*() and char* c_str(). Why write both?

Deleting, reallocating and then copying it terribly inefficient.
in operator = you should check if this->datasize > length of the new string. In that case you don't have to reallocate, just memcpy.
in operator += you do that, but still, you should be able to append without copying the whole thing. vector does that. I believe placement new is used for that. Though I've heard there is something wrong with using it. Don't remember. Never used it myself. There is also realloc in C library for the same purpose.

in operator +=(const char*), first part, you seem not to copy the '\0'

void clear(). Why would clear allocate something? I assume str.clear should be the same as the default constructor..

void reserve(const int). Why reallocate if newsize <= datasize ? Also, a const is quite useless here.

void upper() and void lower(). You should probably use 'A' and etc here instead of integer literals.

unsigned length() and int end(). Why have both? And why do they have different return types?

One problem with your whole code, is that you probably haven't read http://www.cplusplus.com/doc/tutorial/typecasting/ . You don't have to write every function three times (though doing so might be slightly more efficient).
Very nice. As for suggestions for improvement:

- You're mixing signed and unsigned types. Your sizes/lengths are unsigned, but you use signed indexes (making it impossible to index higher-bound things)

- Add reference counting and copy-on-write semantics so that copies are not expensive

- If you add ref counting, also add a way to make an instance unique so it can be used in multithreaded environments

- condense your code into private functions. You have a lot of duplicate/triplicate code there that really should be cleaned up. Your += operators, copy ctor, assignment operator, etc are all reallocating -- that should really be in one distinct function -- and just look at all those replace overloads! Yikes! Same problem with == and != overloads.

- where are the <, >, <=, >= overloads? Gotta have a way to sort strings.

- get rid of your magic numbers and make them static const member vars -- or maybe user modifyable static vars (you seem to like 33 for the minimum buffer size)

- ugh @ the cast operators. operator char* () is especially ugh and would have all sorts of problems if used. Consider the following:

1
2
3
4
5
6
// someone being dumb with your string class:
string foo = "test";
strcpy( foo, "a" );  // allowed because of your cast operator

if( foo == "a" ) {}  // will be true.. however...
cout << foo.length();  // prints 4!  wtf 


This is partially the fault of the char* operator, but also the fault of your == operator for using null terminators instead of the known length.


- if you want to make it more familiar, you should pick better names for some functions. In particular, "maxsize" is better known as "capacity" for people familiar with STL. Also, "end" does something entirely different in STL and your use of it is kind of confusing.

- maybe add iterators?

- no love for Unicode?
I'll see what I can do :)

As for "why 33?", it's 32 (2^5) + 1 (for the NULL character). It just seemed like a good arbitrary buffer size for the array. If I had followed the pattern I was using for strings (add 2*stringsize) the array would have to be reallocated MUCH more often. Actually, thinking back I don't think I tested to see if the string size*2 > 32, so adding a string like "to" would only increase the array size by 4... I'll fix that too.
Ok, after a few hours of const-correctness fixing, debugging, and compacting, I've come up with this new version:

http://pastebin.com/ZZDM2awS

1. I'm not really sure how to implement reference counting... suggestions?
2. I didn't bother with iterators, maybe next time.
3. Unicode? Ugh, sounds difficult... I'll consider it with my next big change.
I didn't look at it in much detail, but one thing I noticed right away was that your <, >, etc operators don't work right. The idea is they are supposed to alphabetize, but you only have them comparing string length.

I would make like a Compare function that compares two strings and returns the difference between the first mismatching character (or zero if they match) -- basically you'd make strcmp. That way you can tell which string comes first alphabetically and can just call this Compare function from your operator overloads.

As for ref counting the idea is multiple string objects can share the same buffer. You keep a count of the number of objects that have access to the buffer. You increment it when a new string gets the buffer, and decrement it when a string drops the buffer. When the count goes to zero, you release the string data.

One way to do this is to have a separate string data class, and just have your string data point to it:

1
2
3
4
5
6
7
8
9
10
11
12
class stringdata
{
  char* data;
  int refcount;

  //...
};

class string
{
  stringdata* data;
};


Another way would be to keep the ref count separate -- which avoids the double indirection:

1
2
3
4
5
6
class string
{
  char* data;
  int* refcount;  // needs to be dynamically allocated so it can be shared by
     // multiple strings
};



Assignments and copy ctors can be quick because you just use the existing string data and inc the ref count. But when any changes are made to the string, you'll need to see if it's unique, and if not, you'll have to split from the shared buffer and copy the data to a new buffer ("copy on write")
Ok, I think I understand reference counting, but I don't think I'll implement it. It seems like a lot of hassle for a small optimization :(

But, I did change the comparison operators and add iterators:

http://pastebin.com/N5EHrzms
Oh, reference counting. Do that if you want people that write multithreaded code kill you. :D
Unless you can force your users to write singlethreaded code only, reference counted strings is a very bad idea - it will be either extremeley slow and non-portable or broken. Extremely slow, because increasing and decreasing reference counts must be interlocked (and there is no way to do that in a portable way in C++). Broken if you skip the synchronisation.
Last edited on
Ok, I think I understand reference counting, but I don't think I'll implement it. It seems like a lot of hassle for a small optimization :(


It can be quite significant -- but whatever. I figured this was an academic project anyway.

Oh, reference counting. Do that if you want people that write multithreaded code kill you.


Hence my other suggestion to make it threadsafe:

1
2
3
// wrap in mutex
string foo = string_from_another_thread;
foo.makeunique();
Last edited on
And how would the programmer know it is from the same thread or some other thread?
This solution is error-prone - you can easily forget to call makeunique and the bug will be very hard to detect (this is the kind of bug that occurs randomly in production, but never when you debug). Or they will add makeunique every other line just to make sure everything is ok, killing performance.

Immutable strings are much better solution - they are faster than singlethreaded COW strings and they are safer than the standard C++ mutable strings - you are free to pass them anywhere you want, to any unit or thread, by reference or by value.

// Added later: BTW: your suggestion is *not* thread-safe. You have to wrap in a mutex every method that modifies the internal reference counts, not just in this one case when string is copied from another thread.
Last edited on
And how would the programmer know it is from the same thread or some other thread?


I would hope the programmer would be aware of the ownership structure of his program. It's not unreasonable to expect a programmer to be aware of what threads own which objects.

If the programmer is not aware of the ownership structure, he'll have bigger problems than this.

This solution is error-prone - you can easily forget to call makeunique and the bug will be very hard to detect


Then I suppose mutexes are also error prone, because you can easily forget to lock them before accessing a shared object.

Or they will add makeunique every other line just to make sure everything is ok, killing performance.


That wouldn't make performance any worse than it would be without reference counting.

Immutable strings are much better solution - they are faster than singlethreaded COW strings and they are safer than the standard C++ mutable strings - you are free to pass them anywhere you want, to any unit or thread, by reference or by value.


I agree with this statement, however the problem with immutable strings is that they're immutable.

// Added later: BTW: your suggestion is *not* thread-safe. You have to wrap in a mutex every method that modifies the internal reference counts, not just in this one case when string is copied from another thread.


Of course. Whenever any shared complex object is read/written in any way it must be in a mutex. I thought that went without saying.


Although I suppose the makeunique approach would be problematic if you're sharing classes that contain multiple strings. So maybe you're right -- maybe reference counting isn't the way to go.
Last edited on

Then I suppose mutexes are also error prone, because you can easily forget to lock them before accessing a shared object.


And they are! Therefore, a better approach is not to use mutexes directly, but provide thread-safe objects (or thread-safe wrappers for non thread-safe objects), that do locking in them (or better: don't need any locking at all, but this is often hard).


I agree with this statement, however the problem with immutable strings is that they're immutable


But this is not a real problem - because no-one prohibits you having a mutable counterpart used only for string modification, that can be initialised from any of the immutable strings. Strings are usually used in a way that they are heavily modified only when created, afterwards they usually stay unchanged.

It seems as though you are missing a certain friend istream operator << overload. I know people here hate the Command Prompt but if you're talking utility, might as well add it in there.

EDIT: I meant: friend istream operator >>. Oops, hehe.
Last edited on
Shouldn't operator char*(); cover that?
Humm, actually that should, oops. Just to check though, have you tried it?
Yeah, but I couldn't do it within the std namespace because in stringfwd.h (which is apparently included in by iostream somewhere) basic_string<char> is typedef'd as string.
So, essentially, your string class and the standard namespace are mutually exclusive. In that case, I would overload the >> operator as suggested above. However, for the record, I really do like your solution with the char* operator.
EDIT: unless you undefine it within your string class. This is not recommended as it would probably screw up a ton of stuff.
Last edited on
Topic archived. No new replies allowed.