Finding duplicates in a single TEXT file

Hi,

I'm stuck and hope you guys can help me out.

I wrote a program to analyse a file. so basically it opens FileA and finds the wanted data with a While loop till EOF, writes it to FileB... Now i have a problem finding duplicates in that single file as i know one file can only have a fread pointer. Is there other ways to find the duplicates. If anyone is free to help me solve, i can send the whole output file.

Thanks

this is part of the output file.

01 19 2D
01 1D 2C
01 1C 2D
01 1E 2C
01 1D 2D
01 1F 2C
01 1E 2D
01 1B 2D
01 1A 2E
01 1A 2D
01 19 2E
01 18 2D
01 17 2E
01 17 2D
01 16 2E
01 15 2D
01 14 2E
01 13 2D
01 12 2E
01 11 2D
01 10 2E
01 0F 2D
01 0E 2E
01 0D 2D
01 0C 2E
01 0B 2D
01 0A 2E
01 09 2D
01 08 2E
01 08 2D
01 07 2E
01 09 2E
01 08 2F
01 0B 2E
01 0A 2F
01 0D 2E
01 0C 2F
01 0F 2E
01 0E 2F
01 11 2E
01 10 2F
01 13 2E
01 12 2F
01 15 2E
01 14 2F
01 18 2E
01 17 2F
01 1B 2E
01 1A 2F
01 1C 2E
01 1B 2F
01 1D 2E
01 1C 2F
01 1E 2E
02 1D 2F
01 19 2F
01 18 30
49 18 2F
04 17 30
01 16 2F
02 15 30
01 15 2F
01 14 30
01 13 2F
01 12 30
01 11 2F
01 10 30
01 0F 2F
01 0E 30
01 0D 2F
36 0C 30
01 0B 2F
01 0A 30
01 09 2F
01 08 30
01 07 2F
01 09 30
01 0B 30
01 0A 31
01 0D 30
01 0C 31
01 0F 30
01 0E 31
01 11 30
01 10 31
01 13 30
01 12 31
01 16 30
01 15 31
01 19 30
01 18 31
01 1A 30
01 19 31
01 1B 30
01 1A 31
01 1C 30
01 1B 31
01 17 31
01 16 32
01 16 31
01 15 32
01 14 31
01 13 32
01 13 31
01 12 32
01 11 31
01 10 32
01 0F 31
01 0E 32
01 0D 31
01 0C 32
01 0B 31
01 0A 32
01 09 31
01 0B 32
01 0D 32
01 0C 33
01 0F 32
01 0E 33
01 11 32
01 10 33
01 14 32
01 13 33
01 17 32
01 16 33
03 18 32
49 17 33
01 19 32
01 18 33
02 1A 32
01 19 33
01 15 33
01 14 34
81 14 33
01 13 34
01 12 33
01 11 34
01 11 33
01 10 34
01 0F 33
01 0E 34
01 0D 33
01 0B 33
01 0D 34
01 0F 34
01 12 34
01 11 35
01 15 34
01 14 35
03 16 34
01 17 34
01 13 35
01 12 35
01 10 35
01 10 00
01 14 00
1C 13 02
8C 0D 02
01 0D 01
01 18 02
01 19 02
49 1A 03
01 19 03
01 0A 03
19 0B 05
01 10 04
1C 1C 05
1C 1B 06
01 13 05
01 11 05
01 07 06
8C 08 0B
01 0A 0A
01 0B 0B
8C 04 0C
1A 1F 0D
01 20 0D
01 20 0F
01 21 0F
01 1D 10
01 1C 0F
01 09 10
1C 15 11
01 19 10
01 21 10
01 04 11
01 0C 12
88 02 14
01 01 14
8C 02 17
01 23 16
01 22 18
01 02 18
8C 03 19
8C 01 1B
01 04 1A
01 1B 1B
01 12 1B
49 0F 1B
65 0E 1C
8A 03 1C
01 07 1C
1C 14 1D
01 17 1C
01 1E 1E
8C 17 1E
01 15 1D
01 15 1F
01 19 1E
01 1B 1E
01 1B 1F
01 20 22
01 15 21
01 1D 22
01 21 23
01 1D 23
88 04 24
8A 04 23
01 10 25
11 05 26
01 03 28
01 13 28
81 09 29
19 07 29
01 04 2A
8C 1C 2B
01 0C 2C
19 05 2B
02 1D 2F
01 18 2F
01 15 30
01 17 33
02 1A 32
01 14 33

cjmalloy (15)

I don't get what you mean by one fread pointer.
I would use fgetc for this.

#include <cstring>
#include <cstdio>

#define LINE_LEN 8


bool read8(FILE* f, char * c)
{
  int i;
  for (i=0;i<LINE_LEN;i++)
  {
    do
    {
      c[i] = fgetc(f);
      if (c[i]==EOF)
        return false;
    } while (c[i]=='\n'||c[i]=='\r');
  }
  c[LINE_LEN] = '\0';
  return true;
}

bool duplicate(FILE* f, char * c)
{
  static char dup[LINE_LEN+1];

  while (read8(f, dup))
  {
    if (strcmp(c, dup)==0)
      return true;
  }

  rewind(f);
  return false;
}

int main()
{
  FILE * pFileIn, *pFileOut;
  char buff[LINE_LEN+1];

  pFileIn  = fopen("in.txt", "r");
  pFileOut = fopen("out.txt", "w+");

  if (pFileIn==NULL) perror ("Error opening file");
  else
  {
    while (read8(pFileIn, buff))
    {
        if (! duplicate(pFileOut, buff))
        {
          fseek(pFileOut, 0, SEEK_END);
          fprintf(pFileOut, "%s\n", buff);
          rewind(pFileOut);
        }
    }
    fclose(pFileIn);
    fclose(pFileOut);
  }
  return 0;
}

I think this way is the most memory efficient, but you should be able to find a faster way.
I would suggest doing this in c++, your code would probably be waaay shorter.

Last edited on

enreval (3)

hi,

what i meant was i can't read the start and the end of a txt file at the same time right...

anyways do you know how to read the last line of a text file.. i used seek_end... but i just couldn't read the last line.

Zaita (2770)

Your best bet is to read the file into memory, then work on it in memory.

Create a vector<string> vFileList; and load each line into that with vFileList.push_back(sCurrentLine);. Once you've loaded the whole file then you can work on it from the vector with no problems.

Edit: If you just wanna find duplicate lines the fastest way possible. Use a Map. As you load each line increment the value of it in the map by 1.
map<string, int> mLinesCounter; :)

Last edited on

cjmalloy (15)

I think seek_end will take you past the last line.
If your file is not unicode I'm guessing it would be fseek(FILE, -8, SEEK_END)

firedraco (6249)

++to zaitas solution, nice way of doing it.

Finding duplicates in a single TEXT file

C++

Forum