Remove Duplicates in a Data Set

Have you ever needed to remove duplicate rows in a data set? Well me neither, but an unnamed PM asked me to do it one night long ago and I am finally getting to publishing how.

The Setup

I have some survey data (totally fake) and in the data there are duplicate rows with one column having different data. Notice Viv has 3 of the same rows listed with various favorite foods. Not good when we what to know for sure what the people’s favorite food is!

How can she like candy, soda, and sugar??

Method 1 – Project Calculation

In this method I will add calculations right in my project. First calculation I add is:

 RCOUNT(PERSON_NUMBER by PERSON_NUMBER)

This will do a running count of the rows based on the column PERSON_NUMBER. I want one row per person so this is why I selected that column.

Counting the row for Viv.

Now that I am counting the rows I want to know the total count and I want that available at the row level.

COUNT(PERSON_NUMBER by PERSON_NUMBER)

This will yield a total count for that person and display it in every row.

Viv has 3 rows.

Now I have to pick what row to keep in this case I have 2 examples first or last and yes is it arbitrary. For last row you can use this function:

CASE WHEN RCOUNT(PERSON_NUMBER by PERSON_NUMBER) - COUNT(PERSON_NUMBER by PERSON_NUMBER)=0 THEN 'Keep' ELSE 'Hide' END

For first you simply do the following:

CASE WHEN RCOUNT(PERSON_NUMBER by PERSON_NUMBER) - COUNT(PERSON_NUMBER by PERSON_NUMBER)=1 THEN 'Keep' ELSE 'Hide' END

From here you can leverage visualization or canvas filters to only show the rows you want.

I am an 80s kid.

Method 2 – Data Flows

Using data flows has the advantage of not having to apply filters or have extra columns for users to wonder about as the resulting data set will only have the rows and columns needed for analysis.

Over all the calculations are just about the same and I have set this up to produce two data sets; one for first row and the other for last row.

Prior to my split I add columns for Total Count and Running Count using the same exact functions as above.

These columns are used in both my first and last functions so I want them in place prior to the split. Also I plan to exclude those columns in my output so the data set used in my projects are clean.

This time for picking the first row all I needed to do was to filter on my Running Count column when it is = 1. Same concept as the calculation above.

For keeping the last row a calculation is needed to determine the this as there can be a different number of duplicate rows for a given person. This function just subtracts Total Count from Running Count and when 0 we have the last row.

Then for this branch I use the same filer concept and above and filter to show only those where the above calculation is = 0.

The rest of the Data Flow just cleans things up like hiding my calculation column along with any renames needed.

Summary

Overall this is not as straightforward as some single click option, but it gets the job done and the concepts here can be customized to meet a bunch of unique needs. I am sure this can also be done in a better way but I was up too late to try when I created this project. Hope it helps!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.