So covid19 has struck, I have no excuses in the world not to code so I wrote a simple program to count the number of characters and number of words in a text file, not only that but I decided to find the most common word in that text file.
so I want some feedback, to me the code looks messy and complicated and also lacks modularity, so should I break some of the loops in the mostCommonWord function into possibly helper functions?
(also yes I know.... what happens if 2 or more words are as equally common, I have yet to implement that.)
Just at a glance, it seems like you're doing unnecessary work.
First, you're parsing the file more than once. Second, it seems like you convert things from and into stringstreams redundantly.
Also, from a readability standpoint, you have a vector<string> called "allWords" but this appears to be "lines" instead from main. But then within mostCommonWord, you have "allWords" and "words" as variable names, with "allWords" in main becoming "words" in mostCommonWord. A bit confusing.
Here's what I would do:
- use >> to get each word, delimited by whitespace
- parse out the commas or periods (your line 88)
- add to a map<string, int>, where int is the number of occurrences of that word.
- Then, after you're done iterating, I think you'll be able to leverage the standard library's max_element function to find the most common word instead.
Edit: You actually can also avoid using max_element if you keep track of the biggest {word + count} pair as you go.
so should I break some of the loops in the mostCommonWord function into possibly helper functions?
This is probably one of the biggest issues I see with the code that I'm working with. Functions get so long that you cannot figure out what the function is supposed to be doing without a lot of time poring over it.
So: If you are asking the question, then YES! break it into separate functions. With experience you will see when you are overdoing it, but for for your toy project, err on the side of breaking things into pieces that are too small.
I think this is a valid simplification of your code.
Note what Ganado said about using a map. It's a lot easier.
And of course, if you really need to sort a vector, there's std::sort.
I am with Ganado here. I think you can find a better way rather than try to break up what you have. Breaking it up may be a great exercise, though, just to say you did it and to practice the idea of small functions doing small things working together.
I would add 'upper or lower case the words' to his list.
// Find the most common word
#include <algorithm>
#include <cctype>
#include <ciso646>
#include <functional>
#include <iostream>
#include <sstream>
#include <string>
#include <unordered_map>
using std::string;
// All this stuff is to make our histogram case-INsensitive
string tolower( const string& s )
{
auto t{ s };
for (char& c : t) c = (char)std::tolower( c );
return t;
}
struct ci_string_equal_to
{
booloperator () ( const string& a, const string& b ) const
{
return tolower( a ) == tolower( b );
}
};
struct ci_string_hash : public std::hash <string>
{
booloperator () ( const string& s ) const
{
return std::hash <string> ::operator () ( tolower( s ) );
}
};
// The histogram
using histogram = std::unordered_map <string, std::size_t, ci_string_hash, ci_string_equal_to> ;
// Main program
int main()
{
histogram h;
string s;
while (getline( std::cin, s ))
{
// Replace ALL punctuation except ' (words like 'cause and ain't and trippin' are all valid)
for (char& c : s) if (!std::isalnum( c ) and (c != '\'')) c = ' ';
// Extract words
std::istringstream iss( s );
while (iss >> s)
{
// If word both begins and ends with ' then we can safely discount both apostrophes
if ((s.front() == '\'') and (s.back() == '\''))
{
h[ s.substr( 1, s.size() - 2 ) ] += 1;
continue;
}
// Otherwise we'll add both with and without leading and trailing ' to the histogram
if (s.front() == '\'') h[ s.substr( 1 ) ] += 1;
if (s.back() == '\'') h[ s.substr( 0, s.size() - 1 ) ] += 1;
// Count the word
h[ s ] += 1;
}
}
// Find the largest repeat count
std::size_t n = 0;
for (auto [_, count] : h)
n = std::max( n, count );
// Special case: no words
if (n == 0)
{
std::cout << "There are no words.\n";
return 0;
}
// Special case: no repeats
if (n == 1)
{
std::cout << "There are " << h.size() << " distinct words (with no repeats).\n";
return 0;
}
// Extract the word(s) with the largest repeat count
std::vector <string> most_common_words;
for (auto [word, count] : h)
if (count == n)
most_common_words.emplace_back( word );
// Display the most common word(s)
std::cout << "The most common word" << ((most_common_words.size() > 1) ? "s are:\n" : " is:\n");
for (auto word : most_common_words)
std::cout << " " << word << "\n";
}
great points, that is another flaw in my code words with the first letter capitalised are counted as separate words , this shouldn't be the case.
I've been told this on numerous occasions, to change my code to the newer versions of C++,
right now my code is very C++98 orientated, it's crazy how different C++11+ looks to C++98, it almost looks like a new language with the amount of added features, one may argue though that with all these additions the language may start to suffer from feature creep?
a little. a lot of C and C++98 and C++ pre-98 are there to allow you to compile old code without messing with it, or minimally. So the 'excess' old features are not intended to be used for new code, though people do because copy from internet which itself is over 30 now and has old info all over, + old books, + some schools are behind, etc.
If you stick to the newer stuff, you write a lot less code to do the same work. The downside is, there are a lot of tools that take a while to learn.
At the end of the day, the language is meant to make expressing your design/idea easier. The new stuff is doing a good job of that, at the cost of having to learn more things.
And it is absolutely right. Current C++ design goals (for the past N years now) have favored technical creep over end-user features. Hell, they’ve even gone out of their way to outlaw or obviate writing code that does common, useful things.