To group by URL, you need a database or the like. Back to the same place.
I already have the logic for generic tab dedupe in place so I just ran it through. Deduping by URL would be something custom made for this particular dumps. Ideally they fix it upstream.
Data cleansing can be very time consuming and tedious.