Automated Data Cleansing with Machine Learning
Customer Challenge
Poor data quality is hindering the Department of Navy’s (DON) ability to gain valuable and accurate insight from their data. Given the volume of errors, manual correction is ineffective and inefficient.
Innovative Solution
ILW data scientists implemented Phase I of our Automated Data Cleansing and Analysis Tool (ADCAT), which applies machine learning (ML) and probabilistic graphical modeling (PGM) to automatically cleanse DON data of errors. For Phase II, ILW applied algorithm enhancements, optimization, model quality monitoring, and user interface creation for improved healing functionality across domains as well as deployed ADCAT to a DON production environment.
Benefits/Outcomes
- Robust natural language processing (NLP) and ML classifier models, achieve 96 – 99.8% accuracy
- ADCAT’s PGMs provide end-users with the five most probable corrections for a given error; 98% of the time the correct value was in the top five most probable values
- Exposes black box of ML error correction logic by providing transparent, human-understandable explanations
- Scalable processes and automatic discovery methods enable new error correction models to be built quickly
- Human-in-the-loop solution is available to enable review and validation of the ML-driven error corrections
Business Value
- Improved analyst productivity: less time correcting data, increased focus on core mission tasks
- Higher quality data: higher-confidence, data-informed decisions, cost savings
Toolbox
- Supervised/unsupervised ML
- Probabilistic graphical models (Bayesian Networks)
- Natural language processing
- Open-source Python solution using DoD-compatible libraries
Domain Expertise
- NAVAIR maintenance data
- NAVSEA labor data