OpenDeID Dataset
The OpenDeID corpus stands as a significant milestone as the first Australian-based gold-standard corpus specifically designed for patient de-identification purposes. This corpus holds immense value for the development and refinement of automated patient de-identification systems, whether they rely on rule-based algorithms or machine learning approaches. Comprising a total of 2,100 pathology reports, each report averaging approximately 717 tokens, the dataset draws from a pool of 1,833 cancer patients. Within this corpus, a meticulous annotation effort has resulted in the identification of 38,414 Protected Health Information (PHI) entities. Impressively, the inter-annotator agreement and deviation scores for all three de-identification settings demonstrate a high level of accuracy, measuring at 0.9464 and 0.9503, respectively. Worth noting is the fact that the corpus has been manually annotated with surrogate information, ensuring the absence of any identifiable patient data. This resource, meticulously crafted and rich in de-identified patient information, serves as a critical asset in advancing the development and evaluation of de-identification technologies and practices while upholding stringent privacy standards.
About HSA Biobank
The HSA biobank is a collaborative initiative based at the Lowy Cancer Research Centre at the University of New South Wales, Sydney, Australia. It aims at supporting researchers and clinicians for the advancement in the field of translational cancer research across Australia and internationally. The HSA Biobank houses all types of tumor tissues obtained from patients who have undergone surgery at one of the HSA hospitals and have provided their consent to the HSA Biobank. Please refer to this for more information.
http://www.tcrn.unsw.edu.au/hsa
Ethics approval
This project is approved by the UNSW ethics panel and specific approval information can be found below.
Principal Investigator: Jitendra Jonnagaddala (profile)
Project Title: Automatic de-identification of unstructured pathology reports using deep neural networks
Approval Number: UNSW HC17749
Dataset access fees
Data access fee - A$2,800 (Excluding taxes)
Additional A$950, if the ethics approval is not in English
Payment via bank transfer to UNSW Sydney
Dataset Access Instructions
Fill out this form to obtain the corpus
Once the request is approved, please sign and return the SREDH Consortium membership, data usage, and project description forms that will be sent upon filling out the data request form above.
Pay data access and associated fees, if applicable
Download the dataset from the SREDH secure server.
Submit a progress report every 6 months until the completion of the project
Access Criteria
Available to researchers( academic and non-academic) for non-commercial purposes
Researchers need to have experience in handling sensitive patient and training in ethics.
Researchers are required to report bi-annually to the SREDH Consortium on any research outputs that arise.
Any output that arises from this dataset needs to be reviewed by the data custodian (SREDH Consortium) before submission.
Selected publications
Jonnagaddala, J., Chen, A., Batongbacal, S., & Nekkantti, C. (2021). The OpenDeID corpus for patient de-identification. Scientific reports, 11(1), 19973. https://doi.org/10.1038/s41598-021-99554-9
Chen, A., Jonnagaddala, J., Nekkantti, C., & Liaw, S. T. (2019). Generation of Surrogates for De-Identification of Electronic Health Records. Studies in health technology and informatics, 264, 70–73. https://doi.org/10.3233/SHTI190185
Alla, N. L. V., Chen, A., Batongbacal, S., Nekkantti, C., Dai, H., & Jonnagaddala, J. (2021). Cohort selection for construction of a clinical natural language processing corpus. Computer Methods and Programs in Biomedicine Update, 1, 100024. https://doi.org/10.1016/j.cmpbup.2021.100024
Liu, J., Gupta, S., Chen, A., Wang, C. K., Mishra, P., Dai, H. J., Wong, Z. S., & Jonnagaddala, J. (2023). OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study. Journal of medical Internet research, 25, e48145. https://doi.org/10.2196/48145