PHDD Corpus: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic

Corpus Description

The PHDD corpus contains of a set of manually tagged tweets in English for privacy purposes. Each Twitter post is tagged in 3 dimensions represented below:

-Health Sensitivity Status: Specifies whether the tweet contains any physical health information of an identifiable individual (Health Sensitive/Not Health Sensitive).

-Health Information Category: For Health Sensitive tweets, determines type(s) of the provided physical health information (Test Result, Symptom, Disease, Other information). All health-sensitive tweets are also tagged with a "Physical Health" label.

-Health History Subject: For Health Sensitive tweets, specifies the position of a tweet author with respect to the disclosed information; whether it is a self disclosure of the author's health history (Individual Health History), a disclosure about the author's family health history (Famiy Health History), or a disclosure about other identified/identifiable people (Others HealthHistory).

To publish this corpus in a machine-readable format, we used a lightweight ontology called "Privacy Tags for Health Information (PTHI)", created specifically for this goal. Supplementary information about PTHI can be found here.

A typical record in the corpus contains an identifier and a set of associated tags. You can see below the diagram of a sample example record, as well as its associated RDF code.:

@prefix pthi: <https://protect.oeg.fi.upm.es/def/pthi#/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dpv: <http://www.w3.org/ns/dpv#> .
@prefix sioc: <http://rdfs.org/sioc/ns#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

pthi:1298037841700638726 a sioc:Post ;

sioc:id"1298037841700638726"; sioc:created "Mon Aug 24 23:21:52 +0000 2020" ; sioc:content "@RexChapman My mom has dementia and shes in quarantine because of COVID. Im afraid shes not going to remember who I am."@en; pthi:hasHealthSensitiveStatus pthi:healthSensitive ; pthi:hasHealthInformationCategory pthi: PhysicalHealth, pthi:Disease ; pthi:hasHealthHistorySubjectTag pthi:FamilyHealthHistory .
pthi: PhysicalHealth owl:sameAs dpv:PhysicalHealth . pthi: FamilyHealthHistory owl:sameAs dpv:FamilyHealthHistory .

According to the Twitter's privacy policy, it is restricted to re-publish the content of tweets. Due to this restriction, Tweet's text are not included in the main corpus, but you can retrieve them using their identifiers.

Corpus Download

The corpus contains tags in the aforementioned dimensions using this criteria. You can download the corpus in RDF or XLSX format.

Authorship

For copyright reasons, the text is not available for download (but requests at rsaniei.AT.delicias.dia.fi.upm.es will be considered). However, the annotations are work of Rana Saniei, Delaram Golpaygani, Beatriz Gonçalves Crisóstomo Esteves, and Karen Vázquez Flores. They are freely downloadable under the CC-BY 4.0 license.