renatoheeb.com

Duplo

Intro

While I was traveling I used my GoPro to take some pictures and videos. After importing, my files were synced automatically to my google drive. After syncing I removed the files from my notebook and from my cam. To delete the files from my cam I used the GoPro Quick software. The "Delete All" function didn't work properly and I reimported the same files over and over. Google's backup and sync tool didn't care and created copies of the same files.

The only difference was the creation date attribute of a particular file. In the end I had many duplicates which were using up valuable storage. I didn't find a free tool to clean up the mess in a timely manner, so I decided to do it programmatically. Be aware that's a straightforward solution to my particular use case, which is not optimized for speed. You can find the repo on github

Access Google Drive API

There's a pretty good Quickstart guide which explains, how to access your files using C#/.Net. When you want to delete files as well, you'll need to add the appropriate OAuth 2 scopes when initializing the connection. The Drive scope should grant you the permissions to delete files:

//
// Summary:
//     See, edit, create, and delete all of your Google Drive files
public static string Drive;

static readonly string[] Scopes = { DriveService.Scope.Drive, DriveService.Scope.DriveMetadata};

Querying & Grouping

I'm not aware of any api method which would allow me to remove duplicates. So the basic idea is to obtain a list of all files and then group the duplicates together. The maximum page size is 1000 and I'm only interested in the specified mime types. Execute the query until you have a complete list of your files:

  FilesResource.ListRequest listRequest = service.Files.List();
  listRequest.PageSize = 1000;
  listRequest.Fields = "*";
  listRequest.Q = "mimeType = 'image/png' or mimeType = 'video/mp4'";
  ...
  do {
    ...
    fileList = listRequest.Execute();
    files = files.Concat(fileList.Files).ToList();
  } while (fileList.NextPageToken != null);

Then group the files by their name and size:

var fileGroups = files
    .GroupBy(g => new { g.Name, g.Size })
    .Where(c => c.Count() > 1);

Identify all duplicates

With all groups available it's straightforward to identify all duplicates. Go through all groups, identify the original file and collect all others for deletion:

var origin = group.OrderBy(o => o.CreatedTimeRaw).FirstOrDefault();

foreach (var file in group) {
    string indicatorSign = "+";
    if (file.Id != origin.Id) {
        duplicates.Add(file);
        indicatorSign = "-";
    }
}

Deleting

All duplicates are identified and stored in a list - let's delete them:

long? total = 0;
foreach (var duplicate in duplicates) {
    total += duplicate.Size;
    service.Files.Delete(duplicate.Id).Execute();
}

Execution

When executing you'll need to grant permissions according to the submitted OAuth scopes:

While executing you'll get some feedback on the console, which is also written into a simple log file. In the end the number of deleted bytes is displayed as well - in my case nearly 60 Gigabyte: