Have you ever needed to remove duplicate rows in a data set? Well me neither, but an unnamed PM asked me to do it one night long ago and I am finally getting to publishing how.
I have some survey data (totally fake) and in the data there are duplicate rows with one column having different data. Notice Viv has 3 of the same rows listed with various favorite foods. Not good when we what to know for sure what the people’s favorite food is!
Method 1 – Project Calculation
In this method I will add calculations right in my project. First calculation I add is:
RCOUNT(PERSON_NUMBER by PERSON_NUMBER)
This will do a running count of the rows based on the column PERSON_NUMBER. I want one row per person so this is why I selected that column.
Now that I am counting the rows I want to know the total count and I want that available at the row level.
COUNT(PERSON_NUMBER by PERSON_NUMBER)
This will yield a total count for that person and display it in every row.
Now I have to pick what row to keep in this case I have 2 examples first or last and yes is it arbitrary. For last row you can use this function:
CASE WHEN RCOUNT(PERSON_NUMBER by PERSON_NUMBER) - COUNT(PERSON_NUMBER by PERSON_NUMBER)=0 THEN 'Keep' ELSE 'Hide' END
For first you simply do the following:
CASE WHEN RCOUNT(PERSON_NUMBER by PERSON_NUMBER) - COUNT(PERSON_NUMBER by PERSON_NUMBER)=1 THEN 'Keep' ELSE 'Hide' END
From here you can leverage visualization or canvas filters to only show the rows you want.
Method 2 – Data Flows
Using data flows has the advantage of not having to apply filters or have extra columns for users to wonder about as the resulting data set will only have the rows and columns needed for analysis.
Over all the calculations are just about the same and I have set this up to produce two data sets; one for first row and the other for last row.
Prior to my split I add columns for Total Count and Running Count using the same exact functions as above.
These columns are used in both my first and last functions so I want them in place prior to the split. Also I plan to exclude those columns in my output so the data set used in my projects are clean.
This time for picking the first row all I needed to do was to filter on my Running Count column when it is = 1. Same concept as the calculation above.
For keeping the last row a calculation is needed to determine the this as there can be a different number of duplicate rows for a given person. This function just subtracts Total Count from Running Count and when 0 we have the last row.
Then for this branch I use the same filer concept and above and filter to show only those where the above calculation is = 0.
The rest of the Data Flow just cleans things up like hiding my calculation column along with any renames needed.
Overall this is not as straightforward as some single click option, but it gets the job done and the concepts here can be customized to meet a bunch of unique needs. I am sure this can also be done in a better way but I was up too late to try when I created this project. Hope it helps!