Word Count

Can someone help me derive a program or to understand how to create a program that counts words? That would be helpful!!!

Especially utilizing no vectors only dynamically allocated arrays please.

jonnin (11437)

You do not need either. just count them.

the format of the input matters...
does it double space after a period? Do you need to detect and ignore non-words, like a number or email address?

for a simple sentence, counting the spaces will work, and adjust for the left and right extreme ends.

eg

this has four words(endl)
----s----s----s 3 spaces, 4 words, that pattern holds up fairly well in standard text, is it good enough for yours? do you understand how cin works and what it does when it hits a space?

for statistics, eg how many times does 'the' appear... you do need to store something. At that point, the problem needs a lot more words to describe what exactly you need to do.

Last edited on

Duthomhas (13207)

You can typically ignore ALL non-alpha characters and get a proper word count.

foxmarine (4)

Well the stipulations of it, it would have to read in a txt file, then count how many occurrences a word appears, it can be case insensitive or case sensitive. It would have to print out the words in least appearance to most with the word count next to each word.

For example:

Hello 0

Hi 1

There 2

He 3

jonnin (11437)

ok, that is a bit more involved. you need to use a to-upper or to-lower across each word to normalize the case, and store an entry for each word next to a count of it in some sort of data structure.

the modern c++ way to do this with minimal coder effort would be to use a map. Are you allowed to use modern tools or is this a 'both hands tied behind my back' school problem?

it may take a bit of doing to print every word not in the file, there are online dictionaries I guess that you can pull from. Are you sure about that hello count zero entry :P

Last edited on

Duthomhas (13207)

That's a little more specific than "counts words".

You are asking to histogram a file.

When you say "dynamically-allocated arrays" do you mean build yourself a tree or linked list that you can modify? Or just one big array that you allocate once at the beginning?

What level class is this? 101?

seeplus (6592)

You need to define what is meany by a 'word' - or get your professor etc to define what is meant. At it's simplest a word is a sequence of chars delimited by either a white space (space, tab, newline) or begin of text or end of text? Is that sufficient. Do you need to remove punctuation from within the word?

PS this question has been asked previously on this forum.

seeplus (6592)

As a starter, perhaps this. It will count the number of words (delimited by white-space, converted to LC and ignoring non-alpha chars) in a std::string and display the count for each different word. You'll need to add to read from a file and to display/sort the list in count order if required. This uses dynamic memory allocation in the list class to store the words/cnt. I'm assuming you can use std::string...

#include <iostream>
#include <string>
#include <cctype>

class MyList {
public:
	MyList() {};

	~MyList() {
		while (head) {
			const auto cur {head};

			head = head->next;
			delete cur;
		}
	}

	void add(const std::string& wd) {
		for (auto cur = head; cur; cur = cur->next)
			if (cur->wd == wd) {
				++cur->cnt;
				return;
			}

		head = new Node(wd, head);
	}

	void display() {
		for (auto cur {head}; cur; cur = cur->next)
			std::cout << cur->cnt << "  " << cur->wd << '\n';
	}

private:
	struct Node {
		Node* next {};
		std::string wd;
		size_t cnt {};

		Node() {}

		Node(const std::string& w, Node* nxt, size_t ct = 1) : next(nxt), wd(w), cnt(ct) {}
	};

	Node* head {};
};

int main() {
	const std::string text {"these are    words in    the    sentence. These are     also words    in a different sentence! "};

	MyList words;
	bool st {};
	std::string wd;

	for (auto chp {text.c_str()}; *chp; ++chp) {
		if (std::isspace(static_cast<unsigned char>(*chp))) {
			if (st) {
				words.add(wd);
				wd.clear();
				st = false;
			}
		} else {
			st = true;

			if (std::isalpha(static_cast<unsigned char>(*chp)))
				wd += static_cast<char>(std::tolower(static_cast<unsigned char>(*chp)));
		}
	}

	if (!wd.empty())
		words.add(wd);

	words.display();
}


1  different
1  a
1  also
2  sentence
1  the
2  in
2  words
2  are
2  these

Last edited on

Duthomhas (13207)

@seeplus I don't think OP is at linked-list level yet. Or that this class is using high-level C++ constructs yet. Think:

  const unsigned MAX_NUM_WORDS = 5000;  // maximum number of word/int pairs
  char** words  = new char*[MAX_NUM_WORDS];  // array of (char array)
  int*   counts = new int  [MAX_NUM_WORDS];  // array of int
  unsigned num_words = 0;  // how many word/int pairs do I have?

  ...

  delete[] counts;  // delete array of int/count
  while (num_words --> 0) delete[] words[num_words];  // delete each word
  delete[] words;  // delete array of words

I am unsure how OP is expected to get text from the file. One character at a time would be easy here...

If wanted, the entire file can be loaded into memory in one go:

#include <fstream>
#include <sstream>
#include <string>

std::string read( const std::string& filename )
{
  std::ifstream fs( filename );
  std::stringstream ss;
  ss << fs.rdbuf();
  return ss.str();
}

The returned string can be treated as a 1-D array of the file's content.

  auto s = read( argv[1] );
  for (unsigned n = 0; n < s.size(); n++)
  {
    if (std::isprint( s[n] )) std::cout << "'" << s[n] << "' ";
    else                      std::cout << "^" << (s[n] + 'A') << "  ";

    if (std::isalnum( s[n] )) std::cout << "is alpha-numeric.\n";
    else                      std::cout << "is neither a letter nor a digit.\n";
  }

But, you can certainly just read from file directly:

  char c;
  while (f.get( c ))
  {
    if (std::isprint( c )) std::cout << "'" << c << "' ";
    else                   std::cout << "^" << (c + 'A') << "  ";

    if (std::isalnum( c )) std::cout << "is alpha-numeric.\n";
    else                   std::cout << "is neither a letter nor a digit.\n";
  }

We will see, I guess.

jonnin (11437)

yea who knows what he is allowed to do. this could be 5 lines or 500... and range from aggravating to simple in terms of how a 'word' is defined.

I never liked the is-digit is alpha etc tools. You invariably end up checking the same value multiple times with those things ... I prefer a lookup of 256 entries with the desired results given. That can even toupper / tolower WHILE determining what it is etc, one op for all. If only unicode would cooperate with that kind of processing :(

Last edited on

Duthomhas (13207)

Unicode is currently a nightmare in C++.
Fortunately, ICU exists on every platform that matters...

But I doubt that Unicode is anything OP needs to worry about.

I admit that just counting words looked fun when I first saw this topic, and I wrote this:

#include <cctype>
#include <iostream>

unsigned wc( std::istream& f, int n = 0 )
{
  do if (f.peek() == EOF) break; while (!std::isalnum( f.get() ));
  do if (f.peek() == EOF) break; while ( std::isalnum( f.get() ));
  return f ? wc( f, n+1 ) : n;
}

Here's a driver if you wish:

#include <fstream>

template <typename T> T& lvalue( T&& arg ) { return arg; }

int main( int argc, char** argv )
{
  std::cin.sync_with_stdio( false );
  std::cin.tie( nullptr );

  unsigned sum = 0;
  if (argc > 1) while (argc --> 1) sum += wc( lvalue( std::ifstream( argv[argc] ) ) );
  else                             sum  = wc( std::cin );

  std::cout << sum << "\n";
}

// The file "wc.cpp" has 100 'words'.

But then OP indicated he wanted to histogram stuff, just using arrays... ugh.

Duthomhas (13207)

Tail-call recursion, baby!

ne555 (10692)

¿what's your issue?
¿are you able to extract all the words from input into an array?
¿are you able to extract a single word from input?
¿is the problem with expanding the array dynamically?
¿you do have the array but don't know how to count similar words?
¿you can count them but don't know how to sort them?
¿what's your issue?

> For example:
> Hello 0
> Hi 1
> There 2
> He 3
Suppose that your input was «He Hi He There He», I don't understand why your output has «Hello»
¿why did you count Hello?

foxmarine (4)

I need to print the word occurence in order from least to most occured. Right now it prints from most to least and ican't figure it out. much help will be greatly appreciated.

foxmarine (4)

if anyone is on discord, i'd like to join up with one of you and we can go over the issue on there. just reply with your discord and i'll shoot you a dm

Duthomhas (13207)

It sounds like you have things sorted already.

So when you print, instead of:

  for (int n = 0; n < N; n++)
  {
    std::cout << array[n] << ...
  }

use:

  for (int n = N; n --> 0; )
  {
    std::cout << array[n] << ...
  }

Hope this helps.

ne555 (10692)

> n --> 0
¡puaj!

Duthomhas (13207)

You got a problem with the venerable "goes to" operator?

Topic archived. No new replies allowed.