is there a built in method for detecting outliers ?

Pages: 12
@Garnado , " By default, an outlier is a value that is more than three scaled median absolute deviations (MAD) away from the median"

I am not sure if my english helps me here. Right now I can calculate the MAD for a given N of readings ... does this sentence mean that if the point is 3 (or another scale i pick) times bigger or lower then the MAD, then it is an outlier ?

I am not sure the last part "(MAD) away from the median" means ?

@joinnin, also please feel free to share your opinion.
I read it as a distance, so pure >. that is distance from median to outlier >= 3*MAD.
I could be wrong; does the matlab docs have any math in its explain?

3* seems extreme to me, but you may as well start there. And as someone already said, you need to check your function against matlab and get the same answer over several data sets to be sure you have it right.
Last edited on
I didn't claim to know what it means either, I just quoted what MatLab said :)

I would calculate the median and then see points where | [value] - median | > 3 MAD, like jonnin said.
But I don't know what it means by "scaled" MAD. Is it referring to the "consistency constant" being 1.4826 if we assume normally distributed data? I don't know.
According to this website: https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/mad.htm
The scaled MAD is defined as

MADN = MAD/0.6745
which is the same as multiplying it by 1.48258, so this at least is consistent.

So, in total, it seems the formula is : if ( | [value] - median | > 3*1.4826*MAD) then value is outlier.

This site also talks about using MAD to find outliers: https://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/

Again, to know if this is correct, the best way would be to run the outliers function in matlab and see what it does for particular test data (AKA Unit Tests), and see if you can get the C++ code to match it.
Last edited on
If you applied that formulation to a perfectly smooth sine curve then you would end up clipping the top and bottom horribly.

@HSafy, you aren't making clear if you are doing this smoothing AS you are gathering data or AFTERWARDS.

If you are doing in as you are gathering data then you won't know what the median is, and you can really only do something based on the last few points read.

If you are doing it afterwards then you could always apply a low-pass filter (FFT; remove high-frequency contribution; inverse FFT).

Did you actually have this working in MatLab (in which case, why not show your script) or was that hypothetical?
its about gathering data as he goes.
another way to handle this (sorry for the brainstorm approach) is to test the heck out of it and gather your data up front, get your statistics on your testing, and check new values against your baseline. This also has flaws, but its yet another thing you can consider.
Last edited on
@jonnin, @ganado, Thanks for the further explaination about the equation and the provided links . I will try this procedure of comparing my function to the one in Matlab. Sounds perfect !

@lastchance, I am doing it while I am gathering the data yes and I know i will be only depending on the history of the data but its better than just leaving the noise in there ?
I did the outlier detection once in Matlab after collecting data not during gathering the data. Moreover, that was done like 2 years ago in my university project and I am not even sure I still have the script :).
HSafy wrote:
I did the outlier detection once in Matlab after collecting data not during gathering the data.
I think they would be extremely different things. You can't know the median whilst your data gathering is still going on.

Well, here's a first attempt for real-time processing. Only does anything worthwhile for isolated "noisy" points, and you would have to define the tolerance for "noisy". An alternative might be to extrapolate from previous points in order to define the "target" replacement value.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#include <iostream>
#include <fstream>
#include <algorithm>
#include <cmath>
using namespace std;

const int mid = 1, bufsize = 2 * mid + 1;     // de-noise mid, in a buffer of length bufsize


void denoise( double buffer[] )               // put whatever de-noising you like here
{
   const double TOLERANCE = 0.5;                                            // decide a tolerable variation
   double target = 0.5 * ( buffer[mid-1] + buffer[mid+1] );                 // average of points either side
   if ( abs( buffer[mid] - target ) > TOLERANCE ) buffer[mid] = target;     // if "noise" then replace
}


int main()
{
   int pos = -1;
   
   double buffer[bufsize] = { 0.0 };
   
   ifstream in ( "data.in"  );
   ofstream out( "data.out" );
   
   double value;
   while ( in >> value )
   {
      pos++;
   
      // Enter value into buffer
      if ( pos >= bufsize )         
      {
         rotate( buffer, buffer + 1, buffer + bufsize );
         buffer[bufsize-1] = value; 
      }
      else
      {
         buffer[pos] = value;
      }
   
      // If buffer is full then denoise and output first point in buffer
      if ( pos >= bufsize - 1 ) 
      {
         denoise( buffer );
         out << pos - ( bufsize - 1 ) << ' ' << buffer[0] << '\n';
      }
   }

   // Dump the rest of the buffer
   for ( int i = 1; i < bufsize; i++ ) out << pos - ( bufsize - 1 ) + i << ' ' << buffer[i] << '\n';
}



Slightly noisy:
https://imgur.com/pslAXD4

De-noised
https://imgur.com/t4cbeqJ
Last edited on
@lastchance, Thank you for the code ! I will compare it with the one I am trying to develop now.

I developed a program following the Matlab formula. However, I believe it is really slow. Because I am calculating the median twice for 6 arrays. I am still working on that to be honest using std::nth_element().

Moreover, I am facing the MAD = 0 problem. For an N array having the same values the MAD will be zero.. so the formula will fail actually. So I decided to add a simple if condition that MAD should not be equal to zero

Then I have noticed that comparing a number to zero is tricky. For example if my MAD is 0.000001 ... it may be considered as zero or ?

@jonnin, @garnado, please feel free to advise.
Last edited on
may want to show any slow code? Don't calculate the same data twice, of course...
Be sure you compiled in release mode. Also don't worry about speed until you have right answer. Otherwise you have to tune it again after fixing.

zero mad is valid, if useless (makes nearly everything an outlier?). what fails *exactly*? do you divide by zero? That is an issue anyway ... a small MAD would blow up also. Compare like this you can set up a tolerance yourself... if abs value >0.001 its not zero type thing. This depends on your data where you want to set it, or generally machine tolerances if you just want to keep it under control.

Hard to say anything without any code. And I can't follow links.
@ jonnin , sure.

I am not sure if I said already but to give you the brief. I am collecting data from gyroscope and accelerometer in x,y,z so that gives me 6 streams of data to filter. I have a class for the " FilterNoise" where DataHistoryForOutliers is a structure that has 6 eigen vectors containing the last s_nReadings (10 in my case) -- maybe it is a big window i dont know . and MedianAbsoluteDeviation is a structure consists of 6 doubles

1
2
3
4
5
6
7
8
9
10

void FilterNoise::noiseFilterProcess(const int& count, DataHistoryForOutliers& dataHistory, Eigen::Vector3d& acclData, Eigen::Vector3d& gyroData)
{
    MedianAbsoluteDeviation signalMedian;
    saveDataToArrays(count, dataHistory, acclData, gyroData);
    signalMedian = calculateMedian(dataHistory);
    calculateMAD(signalMedian, signalMAD);
    checkOutliers(signalMedian, signalMAD, acclData, gyroData);
    replaceOutliers(signalOutliersFlags, dataHistory, acclData, gyroData);
}


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void FilterNoise::saveDataToArrays(const int& count, DataHistoryForOutliers& dataHistory, Eigen::Vector3d& acclData, Eigen::Vector3d& gyroData)
{
    if (count < s_nReadings) {
        dataHistory.accX.conservativeResize(count + 1);
        dataHistory.accY.conservativeResize(count + 1);
        dataHistory.accZ.conservativeResize(count + 1);
        dataHistory.gyroX.conservativeResize(count + 1);
        dataHistory.gyroY.conservativeResize(count + 1);
        dataHistory.gyroZ.conservativeResize(count + 1);

        saveData(count, dataHistory, acclData, gyroData);
    } else {
        windowMove(dataHistory.accX);
        windowMove(dataHistory.accY);
        windowMove(dataHistory.accZ);
        windowMove(dataHistory.gyroX);
        windowMove(dataHistory.gyroY);
        windowMove(dataHistory.gyroZ);
        saveData(s_nReadings - 1, dataHistory, acclData, gyroData);
    }
}


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void FilterNoise::saveData(const int& count, DataHistoryForOutliers& dataHistory, Eigen::Vector3d& acclData, Eigen::Vector3d& gyroData)

{
    dataHistory.accX(count)  = acclData((thetaXIdx));
    dataHistory.accY(count)  = acclData((thetaYIdx));
    dataHistory.accZ(count)  = acclData((thetaZIdx));
    dataHistory.gyroX(count) = gyroData((thetaXIdx));
    dataHistory.gyroY(count) = gyroData((thetaYIdx));
    dataHistory.gyroZ(count) = gyroData((thetaZIdx));
}

void FilterNoise::windowMove(Eigen::VectorXd& dataVector)
{
    for (int i = 1; i < dataVector.size(); ++i) {
        dataVector(i - 1) = dataVector(i);
    }
}


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
MedianAbsoluteDeviation FilterNoise::calculateMedian(const DataHistoryForOutliers& signal)
{
    MedianAbsoluteDeviation medTemp;
    DataHistoryForOutliers  historyTemp = signal;

    calculateMedianSTD(historyTemp.accX, medTemp.accX);
    calculateMedianSTD(historyTemp.accY, medTemp.accY);
    calculateMedianSTD(historyTemp.accZ, medTemp.accZ);
    calculateMedianSTD(historyTemp.gyroX, medTemp.gyroX);
    calculateMedianSTD(historyTemp.gyroY, medTemp.gyroY);
    calculateMedianSTD(historyTemp.gyroZ, medTemp.gyroZ);

    return medTemp;

void FilterNoise::calculateMedianSTD(Eigen::VectorXd& dataValues, double& medianValue)
{
    std::vector<double> dataValuesSTD(dataValues.data(), dataValues.data() + dataValues.size());

    std::nth_element(dataValuesSTD.begin(), dataValuesSTD.begin() + dataValuesSTD.size() / 2, dataValuesSTD.end());
    medianValue = dataValuesSTD[dataValuesSTD.size() / 2];
}


}



1
2
3
4
5
6
7
8
9
10
11
12
13
14
void FilterNoise::calculateMAD(const MedianAbsoluteDeviation signalMedian, MedianAbsoluteDeviation& signalMAD)
{
    DataHistoryForOutliers temp = dataHistory;

    for (int i = 0; i < temp.accX.size(); ++i) {
        temp.accX[i]  = abs(temp.accX[i] - signalMedian.accX);
        temp.accY[i]  = abs(temp.accY[i] - signalMedian.accY);
        temp.accZ[i]  = abs(temp.accZ[i] - signalMedian.accZ);
        temp.gyroX[i] = abs(temp.gyroX[i] - signalMedian.gyroX);
        temp.gyroY[i] = abs(temp.gyroY[i] - signalMedian.gyroY);
        temp.gyroZ[i] = abs(temp.gyroZ[i] - signalMedian.gyroZ);
    }
    signalMAD = calculateMedian(temp);
}


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
void FilterNoise::checkOutliers(const MedianAbsoluteDeviation& signalMedian,
                                const MedianAbsoluteDeviation& signalMad,
                                Eigen::Vector3d                acclData,
                                Eigen::Vector3d                gyroData)
{
    signalOutliersFlags.accX  = checkOutlier(acclData(thetaXIdx), signalMedian.accX, signalMad.accX);
    signalOutliersFlags.accY  = checkOutlier(acclData(thetaYIdx), signalMedian.accY, signalMad.accY);
    signalOutliersFlags.accZ  = checkOutlier(acclData(thetaZIdx), signalMedian.accZ, signalMad.accZ);
    signalOutliersFlags.gyroX = checkOutlier(gyroData(thetaXIdx), signalMedian.gyroX, signalMad.gyroX);
    signalOutliersFlags.gyroY = checkOutlier(gyroData(thetaYIdx), signalMedian.gyroY, signalMad.gyroY);
    signalOutliersFlags.gyroZ = checkOutlier(gyroData(thetaZIdx), signalMedian.gyroZ, signalMad.gyroZ);
}

bool FilterNoise::checkOutlier(double dataValue, double signalMedian, double Mad)
{
    if (abs(dataValue - signalMedian) < s_madFactor * s_sigmaFactor * Mad || Mad == s_zeroMad) {
        return false;
    } else {
        return true;
    }
}

void FilterNoise::replaceOutliers(OutliersFlags          signalOutliersFlags,
                                  DataHistoryForOutliers &dataHistory,
                                  Eigen::Vector3d&       acclData,
                                  Eigen::Vector3d&       gyroData)
{
    // Replacing the Outliers with the previous non-outlier value
    if (signalOutliersFlags.accX) {
        acclData(thetaXIdx)                           = dataHistory.accX(dataHistory.accX.rows() - 2);
        dataHistory.accX(dataHistory.accX.rows() - 1) = dataHistory.accX(dataHistory.accX.rows() - 2);
    }
    if (signalOutliersFlags.accY) {
        acclData(thetaYIdx)                           = dataHistory.accY(dataHistory.accY.rows() - 2);
        dataHistory.accY(dataHistory.accY.rows() - 1) = dataHistory.accY(dataHistory.accX.rows() - 2);
    }
    if (signalOutliersFlags.accZ) {
        acclData(thetaZIdx)                           = dataHistory.accZ(dataHistory.accZ.rows() - 2);
        dataHistory.accZ(dataHistory.accZ.rows() - 1) = dataHistory.accZ(dataHistory.accX.rows() - 2);
    }
    if (signalOutliersFlags.gyroX) {
        gyroData(thetaXIdx)                             = dataHistory.gyroX(dataHistory.gyroX.rows() - 2);
        dataHistory.gyroX(dataHistory.gyroX.rows() - 1) = dataHistory.gyroX(dataHistory.accX.rows() - 2);
    }
    if (signalOutliersFlags.gyroY) {
        gyroData(thetaYIdx)                             = dataHistory.gyroY(dataHistory.gyroY.rows() - 2);
        dataHistory.gyroY(dataHistory.gyroY.rows() - 1) = dataHistory.gyroY(dataHistory.accX.rows() - 2);
    }
    if (signalOutliersFlags.gyroZ) {
        gyroData(thetaZIdx)                             = dataHistory.gyroZ(dataHistory.gyroZ.rows() - 2);
        dataHistory.gyroZ(dataHistory.gyroZ.rows() - 1) = dataHistory.gyroZ(dataHistory.accX.rows() - 2);
    }   

}
 


Last edited on
Have you considered making dataHistory a collection of circular arrays so overwriting old data can be done by incrementing a "current" index rather than shifting all of the elements in the array in moveWindow?

This would also allow you to keep sliding totals for dataHistory by subtracting the value being overwritten and adding the new value. This would be very helpful if you ever need to calculate the mean. I don't know if you will need these calculations, but if you do...

Last edited on
dataHistory.accX.conservativeResize(count + 1);
dataHistory.accY.conservativeResize(count + 1);
dataHistory.accZ.conservativeResize(count + 1);
dataHistory.gyroX.conservativeResize(count + 1);
dataHistory.gyroY.conservativeResize(count + 1);
dataHistory.gyroZ.conservativeResize(count + 1);

what does this do... if it reallocates memory and copies everything, that needs an immediate fix.
@jonnin,

I saw that too, then realized it only happens 10 times.

I agree that it should be fixed, but its impact is only at initial startup.
ah, OK. You know me, doing a drive-by instead of studying it deeply ...
I can look deeper later after work. But I don't see anything else 'high risk' of slowness.
little things... reverse condition on outlier test so if its zero it does not do all that junk up front.
abs is slow on some compilers when compared to a raw logic operation. you do a LOT of abs, and that may be worth getting ugly if you find you need the speed. Its not portable nor a good idea unless you can show you spend a % of time in abs. Profile it first for that idea.

circ buffer seems like a good idea.
the real depth stuff... what computations can you save from 'last time' ...
eg if your median is 1+2+3+4+5 (where 1-5 is really data point #s) and then you call it again with 1+2+3+4+5+6... use a static variable and add the new term to the old sum... things 'like' that where you can keep some of the work from previous iterations can give gigantic lift.

whats windowmove do?
Last edited on
whats windowmove do?


It's in the 3rd code snippet. It shifts each element in the array down 1 spot. That's why I recommended a circular array.
ok, I stopped and really read the code all the way this time.

3 changes I would consider
1) change winmove etc to use circular buffer. Doug4 has nailed that, its a big deal here.
2) keep a running total of the circ buffer:
nextloc = (currentloc +1)%arraysize
totalsum -= buffer[nextloc];
buffer[nextloc] = newestvalue;
totalsum += newestvalue;
if that is useful for anything (I don't see you using an average anywhere, I just assumed you would need it at some point? if not its an example of the idea).

is there a way to keep the median current with little to no work on insert here, similar to how I am keeping the total up to date? So you don't have to iterate all the values, just update what you had vs the new one?!

3) stop fooling with the code at all until the buffer is populated. Consider the first N sensor reads as simply calibration and startup time cost.
Last edited on
@jonnin,doug4

Thank you for your comments. I was busy with other uni projects. I will apply what you said and come back to you in case of questions. Thanks again.

@jonnin, No I did not develop a way to include the current median in calculating the one after to be honest. I calculate the the median independently each step. But for sure your idea is great ! Moreover, I am not sure If I follow you in "stop fooling with the code" .. you mean I should not do any operations at all till I get my first N readings or ?

and yes for now and I am not using the mean average as a replacement, I am using the previous non-outlier value,
Last edited on
Right, don't do anything until you fill the buffer is one approach. It does not work for all systems, but yours, it seems reasonable, its a fraction of a second delay and non-critical (no one will die if you don't react to input) system.
Topic archived. No new replies allowed.
Pages: 12