Automated Data Labeling & Curation
Customer Challenge
The Army Intelligence Community (IC) seeks an automated, AI-based system for reliable data labeling and curation that allows users to quickly search, filter, and select datasets for further downstream analysis.
Innovative Solution
For this effort, Illumination Works prototyped our Theia automated data labeling solution. Theia’s automated pipelines apply natural language processing (NLP) and computer vision (CV) algorithms to curate datasets and extract and label key information of interest. The solution automatically processes datasets and identifies labels/topics of importance from textual and image components. Labels are stored in a graph database for downstream analytics, processing, and filtering, and are served to end-users via an intuitive, interactive interface to provide Army IC analysts quick insights into the content and context of datasets without manual review.
Benefits/Outcomes
- Identified techniques to extract entities from textual data to include person, place, date, and business-specific entities such as military equipment
- Applied methods to identify people and military equipment from images, yielding a self-learning and semi-supervised approach that can expand to other focus areas
Toolbox
- Open-source Python solution
- Application visualization: AdobeXD wireframes
- Data science: machine learning, CV, NLP, named entity recognition, knowledge graph, subject-verb-object extraction, self-learning network
Business Value
- Saves significant time over manual labeling
- Informs analysts of content within datasets without having to manually review
- Facilitates interconnected insights between textual and visual data sources
- Enables extension to other domains such as medicine, manufacturing, repair, and agriculture via self-learning approaches
Domain Expertise
- DoD
- Intelligence Community
- Healthcare
- Market Intelligence