The BioTHAD® technology uses natural language processing techniques to provide automated monitoring and analysis of internet-based communications and enable the timely detection of disease or bio-terrorism outbreak events.
Most emerging infectious diseases and agents of bio-terrorism present as nonspecific “flu-like illnesses.” As a consequence, the early detection of disease and biological agents can benefit from the more timely capture and analysis of alternative sources of data instead of waiting for laboratory-confirmed diagnoses. Because syndromic surveillance systems access and analyze data streams not typically available to departments of health, syndromic surveillance has the potential to identify unusual illness patterns prior to definitive diagnoses, allowing health care workers to get a jump on outbreaks. The ability to detect aberrant health related activity using statistical tools that identify unusual spatiotemporal distributions of symptoms can significantly augment traditional surveillance.
A rich source of data for bio-surveillance is the wealth of news articles posted to or available through the web. While widespread availability is a major advantage, the sheer volume of available information and the nuances of natural language used in news articles make this a challenging proposition. The Biovigilance & Threat Assessment Dashboard (BioTHAD®) technology automates the processing of news reports and flags reports of potential interest in a process that includes highlighting or tagging relevant elements within reports and extracting data for subsequent analysis.
The BioTHAD® technology has software agents that can be configured to monitor information sources such as news feeds and medical publications on a daily basis. HTML files are downloaded from designated sites (for example, ProMed Mail, WHO, BBC Health News, CDC Morbidity and Mortality Weekly Reports, The Lancet Infectious Diseases, and BMC Infectious Diseases) and text is extracted from these HTML files, converted to an XML format, and then processed by a natural language processing (NLP) pipeline, developed by KBSI, to extract the relevant information.
The main objective of the initial text processing stages is to identify and classify the phrases that may be constituents of event patterns. The text extraction process includes sentence boundary detection, tokenization, part-of-speech tagging, phrase chunking, Subject-Verb-Object (SVO) assignment, clause segmentation, and named entity recognition. The BioTHAD® technology’s named entity recognition module recognizes generic named entities such as Persons, Locations, and Organizations.
In the domain event extraction step, the domain concept tagger tags concepts that are domain-specific. The BioTHAD® technology tags key concepts associated with disease and biological events: Diseases, Symptoms, Pathogens, Antibiotics, Location, Date, Outbreak terms, Resistance terms, Victim Type, and Severity. Once it has identified domain-specific concepts, the BioTHAD® tool can then more easily extract domain-specific event frames and perform event pattern matching. The event pattern matching component reads all the predefined event patterns and compares them with the processed text to find matches. If a pattern is matched, an ‘event’ is generated. The BioTHAD® tool then extracts relevant textual features from the NLP output—features that are generally very domain specific such as noun phrases, disease, date, location, verb phrases, root form of verbs, etc., are extracted to produce a simplified representation of the processed text.
The feature vector is then fed to the frame matcher, which matches incoming feature vectors against a set of predefined event frames or templates that are stored in the BioTHAD® tool database. This step relies on the database repository of event frames that captures disease incidents in news report.