looking for design feedback: string class "substring"

So I'm working on a series of string classes which work with UTF encoded strings.

I'm not working on the 'substring' functions and am a little torn as to the right way to do it.

My first thought was to have the following functions:

1
2
3
4
5
6
7
utfstr s = "01234567";

cout << s.Left(3);  // 012
cout << s.Mid(3);   // 34567
cout << s.Mid(3,3); // 345
cout << s.Right(3); // 567
cout << s - 3;      // 01234 


I was satisfied with this approach until I realized that these functions use indeces, and indeces can't really be used anywhere else with these strings (since UTF encondings have variable width codepoints, so they're not random access friendly).

So pretty much, working with these strings means you'll have to use iterators. Find functions and the like will all be returning iterators instead of indeces, so it makes more sense to have these substring functions take iterators, right?

But this brought up some other issues:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
utfstr s = "01234567";
utfstr::iterator i = s.begin() + 3;

cout << s.Left(i);   // 012 .. pretty straight forward
cout << s.Mid(i);    // 34567  ..  same

// this one is the problem
cout << s.Mid(i,3);  // 345  ..  could work. uses an integral count... is that okay?
cout << s.Mid(i,i+3);// 345  ..  would this make more sense?
                     //   the problem with this, though, is now instead of the 2nd 
                     //   param being the 'length' as with the previous function
                     //   it's now the 'end' which is inconsistent.

// these other two are a little perplexing as well

cout << s.Right(i);  // ??? this is nonsensical.  Should I just omit this?
cout << s - i;       // ???  same 


Here's the other thing. I don't want to completely omit the index versions since they still have practical use. (wanting to see if the first few characters of a string are something specific, for example).

What do you guys think? Which way should I go with this?


EDIT:

also, before you recommend going with substr instead of Mid for the above function -- I'll probably have substr in addition to the above functions (it'll probably just call Mid()).


EDIT2:

And apparently I've been spelling "indexes" wrong for years now. Ignore that please. ^^. I coulda swore it turned to a c when you made it plural.
Last edited on
If utfstr used another class -- utfchar -- instead of char, would that help with
the index problem at all?
I suppose it could if I really wanted to do that, but I'm not so sure that's a good idea. Every time you index the string (assuming it contains multi-byte codepoints) you'd have to step through the entire string up to the given index. This isn't the behavior you expect when you index something -- you expect it to be fast.


And then you have to consider something like this:

1
2
for(int i = 0; i < s.length(); ++i)
  DoSomething( s[i] );


People WOULD do that if you could directly index, but it'd be terrible performance-wise. It would be stepping through the string every index.
Topic archived. No new replies allowed.