groups » Data Management and Statistics » Data management scenario; non-unique patient IDs

Imagine a healthcare setting where a new patient ID is generated for every new case that comes in, but you’re interested in retrospectively considering how often the same patients returned to that clinic over time – there is no clear way of distinguishing them from the IDs. How do you go about identifying the same patients? Which commonly recorded variables would you use and how reliable are they over time? How many variables would be sufficient to be sure you had the same person?


  • jnmumma Jane Mumma 2 Feb 2013

    Does any one know of an upcoming GCP training in Kisumu, Kenya
    Feb 2 2013

  • Hi Clarissa. The 30% were done manually, mainly because of missing data, or often same name but different date of birth. However there was one case where Jan van der Merwe (a very common Afrikaner name) with a particular date of birth was actually two different people;-)
    In your case where the data is anonymised, I don't see how you can match records at the individual level. I think you'd need some model that matches patient demographic and clinical data. e.g. if you suspect a particular study to have been loaded twice, then you should be able to compare on date of birth, gender, weight, height, diagnosis, and then first visit data such as blood pressure, temperature etc. Date of birth and gender are (to my mind) not sufficient for a large study.

  • clarissam Clarissa Moreira 23 Jan 2013

    Mike, how did you go about matching the remaining 30% of cases?

    We're thinking about this situation here at WWARN, where we don't have any names or initials attached to the data. We want to ensure our repository of data from many different malaria clinical trials doesn't have patient duplicates (sometimes people accidentally send us the same data twice).

    We're going to try matching on age or date of birth, gender and maybe country or continent where the trial was conducted. I'll keep you all updated on how we go with this.


  • Sorry for the late reply, but this may give some insight.
    Working for an insurance co that acquired two other insurers, we had a similar problem of merging client records. Using surname, first initial and date of birth, over 70% were successfully matched automatically.

  • Naomi Naomi Waithira 30 Nov 2012

    In an ideal situation, you would have two identifiers. One would be a person identifier, that uniquely identifies a person and is constant on every visit and another is the case identifier that uniquely identifies a visit.

    In the case that you describe,the type and number of variables you select would largely depend on the data collected. A combination of variables commonly recorded such as date of birth + gender + place of birth would be useful. However you would find some duplicates still (especially if your population is large) and you may need to look up more information to identify unique patients.

  • JimTodd JimTodd 30 Nov 2012

    Interesting. Do you have the names in the patient file? Or are these confidential? Also if the clinic is for something socially embarrasing (eg STI) then people may not use their own name.

    This is an exercise in matching of records and there are no definitive methods for this. But some good algorithms do exist (though I have not got them to hand). I would certainly be interested in any experience with this, even if the initial matching was not that successful,

Please Sign in (or Register) to view further.