Automated Data Labeling & Curation
Customer Challenge
The Army had a need for automated data labeling of web data to make use of the large volumes of data available for their training of explosive ordinance disposal classification models.
Innovative Solution
Illumination Works prototyped our Theia™ automated data labeling solution. Theia’s automated pipelines apply natural language processing and unsupervised computer vision to curate datasets and extract and label key information of interest from textual and image components. Innovative processes clean and deconflict data points and store metadata in a graph database to build and maintain an authoritative ontology for downstream analytics. An interactive user interface provides human-machine-teaming, enabling users to search by text or image to focus on informative data for decision making.
Benefits/Outcomes
- Customized web scraper automatically mines the Internet to gather massive amounts of data to speed data gathering and enhance contextual awareness
- Techniques to extract entities and identify people and military equipment from images
- Self-learning and semi-supervised approach that can expand to other focus areas
Toolbox
- Open-source Python solution
- Application visualization: React library
- Data science: machine learning, computer vision, NLP, named entity recognition, knowledge graph, subject-verb-object extraction, self-learning network
Business Value
- Save significant time over manual labeling
- Inform analysts of content within datasets without having to manually review
- Facilitate interconnected insights between textual and visual data sources
- Enable extension to other domains such as medicine, manufacturing, repair, and agriculture via self-learning approaches
Domain Expertise
- DoD
- Army
- Explosive ordinance disposal