Wednesday, June 24, 2009

VB Macro for deleting or detecting duplicate Outlook Mails

There are so many times that one cant help but wonder at how in spite of all the advances in comfort etc etc, life seems a lot harder than ever... sighhh.

For example, all the haphazard shifting around at office over the last 3 years has resulted in me accumulating a mountain of disorganised backups. Most notable are the outlook backups in PST formats (what are .pst files? They are Outlooks default backup file format extension). I was getting fed-up of helplessly grinding my teeth because of all the duplicate emails in different folders and different PSTs...
..especially when the solution was right in front of my eyes - A simple VB macro!

Finally I took time out to get it done and with a very polished form too as can be seen below screnshots.
To make it easy to carry around, I've stored it as an Excel file.

Download my file or scroll down further to see code to customize.

Excel file containing the macro.

The launched macro!

App in action!
Functionality
The Macro identifies any duplicate items in 2 outlook folders (even those in different PST files) and optionally marks those items.

It does this by simply comparing all the mails (or any Outlook item) in the 2 locations specified and marks the duplicate mails (in both the folders) with a tag of your choice on the Categories attribute of that email. So it only marks duplicate mails and does not do the actual deleting (those who want to automate this step can easily add an extra line in the code to delete instead of mark, but I found it too risky). An option is available to disable even the marking so that you only get the duplicate count in the summary.

Inputs
The actual program inputs are the names of the Outlook PST file and the mail folder of the 2 locations. Of course, if you want to compare 2 folders in the same PST, just provide the same PST name in both sets and choose the different folders.
The other input is the name of the label you want to apply against the Categories attribute for duplicate e-mails.
Finally, there is an option to enable marking of duplicate items.

Steps
Ensure Outlook is running before launching the macro or it will not be able to get the PSTs files loaded in Outlook.
1) Start the Macro.
2) The macro fetches the PST files loaded in your Outlook application and displays the form.
3) Select the PST file in set1. The set1 list gets updated with the folders in this PST file. Select the first folder you need to compare.
4) Similarly, select the PST file in set2. The set2 folder list gets updated with the selected PST file contents. Select the second folder you need to compare here.
5) Next choose whether to Mark the Duplicates or simply show a summary of duplicate sets.
6) Ok Button gets enabled when all the right options are selected.
7) Click on the OK button to start the process! This indexing step of the process may take time for large number of items... a progress bar displays the progress as well as a status bar with basic updates. When the process is done, the status shows the Completed message (and other statistics if you scroll up).
8) Manual step of actual deleting duplicate mails: After running the program, you have to open the Outlook folder that you want to clean up and sort by the Categories column to see those marked in Duplicate like below (the Categories column is hidden by default - this has to be added from the Field Chooser dialog box):
Results showing marked mails under attribute, "Categories"

You can give a quick look to make sure they are really duplicates. Then you simply shift-select those mails and delete! :)

Performance
Efficiency was increased by indexing the 2 folders before identifying duplicates- for 3000 + 1000 emails in the 2 folders, without indexing it took a few hours and still didnt finish (because there would be 3000 * 1000 reading of outlook items)! Now it runs in 5 mins.

Customisation
The current macro only compares the Subject and Creation Date of 2 emails to identify if they are the same (which is actually fairly safe). If you want, you can have more precision by editing the code to check more mail attributes.
Also as mentioned before it is simple to change the code to delete duplicate mails instead of just marking them.

I made the macro portable by embedding it in an Excel document with a launcher button (what you see in the background of the first 2 screens above). The Excel document containing all the code is available here. Of course, you can also create the macro right into your Outlook client itself (you'll have to copy the code manually from the Excel-macro code page to your Outlook-macro code page).

Below is the core code which is called from the form:


Const PST1_NAME = "NewBackup"
Const PST2_NAME = "NewBackup"
Const FOLDER1_NAME = "Inbox"
Const FOLDER2_NAME = "OldInbox"
Const CATEGORY_SEPERATOR = ","
Const FINAL_PROGRESS_ALLOCATED = 20 ' between 1 and 100

Public progressValue 'this holds the percentage completed.
Public progressStatus 'this holds the current status.

' sample with hardcoded psts & folders
Private Sub markDuplicateEmails()
markDuplicates PST1_NAME & SEPERATOR & FOLDER1_NAME, PST2_NAME & SEPERATOR & FOLDER2_NAME, DEFAULT_CATEGORY
End Sub

' actual method which takes dynamic pst\folder source and destination
Public Sub markDuplicates(source, destination, category)
Dim myOlApp, myNameSpace
Set myOlApp = CreateObject("Outlook.Application")
Set myNameSpace = myOlApp.GetNamespace("MAPI")

Dim tmpArray, pst1Name, pst2Name, folder1Name, folder2Name
tmpArray = Split(source, SEPERATOR)
pst1Name = tmpArray(0)
folder1Name = tmpArray(1)
tmpArray = Split(destination, SEPERATOR)
pst2Name = tmpArray(0)
folder2Name = tmpArray(1)

Dim folder1Size, folder2Size
folder1Size = myNameSpace.Folders(pst1Name).Folders(folder1Name).Items.Count
folder2Size = myNameSpace.Folders(pst2Name).Folders(folder2Name).Items.Count

Dim array1() As cstData, array2() As cstData
ReDim array1(folder1Size)
ReDim array2(folder2Size)

Dim outlookItem1, outlookItem2, i, j
Dim theCstmData As Module1.cstData
Dim startTime, endTime

'populate array1
i = -1
startTime = Now
progressStatus = "Indexing set1..."
For Each outlookItem1 In myNameSpace.Folders(pst1Name).Folders(folder1Name).Items
i = i + 1
Set theCstmData.item = outlookItem1
theCstmData.subject = outlookItem1.subject
theCstmData.creationTime = outlookItem1.creationTime
array1(i) = theCstmData
progressValue = 100 * (i / (folder1Size + folder2Size + (folder1Size + folder2Size) * (FINAL_PROGRESS_ALLOCATED / 100)))
DoEvents
Next outlookItem1
progressStatus = "Indexing set1 Complete."
' populate array2
i = -1
progressStatus = "Indexing set2..."
For Each outlookItem2 In myNameSpace.Folders(pst2Name).Folders(folder2Name).Items
i = i + 1
Set theCstmData.item = outlookItem2
theCstmData.subject = outlookItem2.subject
theCstmData.creationTime = outlookItem2.creationTime
array2(i) = theCstmData
progressValue = 100 * ((folder1Size + i) / (folder1Size + folder2Size + (folder1Size + folder2Size) * (FINAL_PROGRESS_ALLOCATED / 100)))
DoEvents
Next outlookItem2
progressStatus = "Indexing set2 Complete."
progressStatus = "Indexing time: " & (Now - startTime) * 60 * 60 * 24

'loop through each item in array1
progressStatus = "Applying Category labels on duplicates..."
For i = 0 To folder1Size - 1
'loop through each item in array 2 comparing each array2Item with current array1item
For j = 0 To folder2Size - 1
' if it is a match mark the item in array2 as duplicate
If array1(i).subject = array2(j).subject And _
array1(i).creationTime = array2(j).creationTime Then
If array1(i).item.Categories = "" Then
array1(i).item.Categories = category
Else
array1(i).item.Categories = array2(j).item.Categories & CATEGORY_SEPERATOR & category
End If
array1(i).item.Save

If array2(j).item.Categories = "" Then
array2(j).item.Categories = category
Else
array2(j).item.Categories = array2(j).item.Categories & CATEGORY_SEPERATOR & category
End If
array2(j).item.Categories = category
array2(j).item.Save
End If
DoEvents
Next j
progressValue = (100 - FINAL_PROGRESS_ALLOCATED) + (FINAL_PROGRESS_ALLOCATED * (i / folder1Size))
Next i
progressStatus = "Total Time: " & (Now - startTime) * 60 * 60 * 24
progressStatus = "All done."
End Sub


Update [25-Jun-09]:
New features
- Option to only count duplicates instead of Marking.
- Now displays even sub-folders recursively
- 2 new status views available to display number of items in selected folder.
- Other performance enhancements
See screenshot below:


Update [03-Jul-09]:
New features
-support for choosing same folder in set1 and set2. This will mark/count duplicate sets of mails in the same folder.

Update [15-Feb-12]:
New features
-Support for identifying Missing mails - choose source and destination folders, and all mails missing in destination will be identified.
-Support for marking labels as well as just counting Duplicates or Missing mails


Note: This was tested on Outlook 2003 but should work on any outlook...

Thursday, June 18, 2009

The Little Guide of IT Jargons and Acronyms

Here's an age-old problem perhaps much before the time Pythogaras stated to a non-mathematical audence: "The square of the hypotinuse..."

See most of us techies are really proud of the way we can rattle of so many tech terms and especially acronyms. Unfortunately, a sentence like that would just go right over a non-techie client or a new team member. What makes things worse is that requirement and document acronyms often vary from company to company.

Of course there are people who use them a lot to show off... and its the general opinion of the other crows that lots of jargon/acronyms implies you are loud-mouthed and/or not really knowledgeable. I absolutely agree when I hear Sales people at Mutual Funds talking about CAMS (of course I do catch them red handed if its a computer shop salesperson!).

Unfortunately a lot of the times, we techies are simply unaware when we're doing it! Perhaps what we could do is distribute a common acronym list before a client meeting? Something like below... of course the list would grow (or shrink?) based on client feedback after every meeting.

I would keep updating as the list would be obviously too large to put up at one go!
The Little Guide of IT Acronyms:
Tech Acronyms
BPM - Business Process Mangement
EDI - Electronic Data Interchange
ETL - Extract, Transform, Load
JSeam - JBoss Seam
RIA - Rich Internet Application
SaaS - Software as a Service
SOA - Service Oriented Archiecture
SOBA - Service Oriented Business Application
XSD - Xml Schema Definition
WSDL - Web Service Definition Language

Doc acronyms
CDS/FDS/TDS - Component/Functional/Technical Design Specification
Specs - Requirement Specifications
SSRS - Sub-System Requirements Specification
UTP - Unit Test Plan
UTR - Unit Test Results

Other
UIT/UAT - User Integration/Acceptance Testing
 
Superblog Directory