Data Quality Framework for Large-Scale Enterprise Data and ML Systems

  IJCTT-book-cover
 
         
 
© 2024 by IJCTT Journal
Volume-72 Issue-2
Year of Publication : 2024
Authors : Mitesh Mangaonkar
DOI :  10.14445/22312803/IJCTT-V72I2P116

How to Cite?

Mitesh Mangaonkar, "Data Quality Framework for Large-Scale Enterprise Data and ML Systems," International Journal of Computer Trends and Technology, vol. 72, no. 2, pp. 92-98, 2024. Crossref, https://doi.org/10.14445/22312803/IJCTT-V72I2P116

Abstract
To maintain a competitive advantage and make informed decisions, the ever-changing business data administration demands utilizing quality data. Customized for ML systems and massive quantities of business data, it presents a robust Data Quality Framework in this research. The framework's capability to accommodate diverse data types and synchronize with enterprise-level analytics is supported by a fusion of sophisticated data governance standards, exhaustive measurements, and technology. The article provides practical illustrations of the framework's operation by incorporating real-world instances from diverse sectors. A paradigm shift occurs when AI and ML techniques are combined to enhance conventional data management processes. In addition to predicting forthcoming developments in data quality management, the report concludes with strategic recommendations for organizations seeking to enhance data fidelity.

Keywords
Data quality framework, Machine Learning, Data governance, Data management.

Reference

[1] Mohsen Jamali, Ziv M. Williams, and Jing Cai, “Unveiling Theory of Mind in Large Language Models: A Parallel to Single Neurons in the Human Brain,” Arxiv, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Tim Nugent, Nicole Stelea, and Jochen L. Leidner, “Detecting ESG Topics using Domain-Specific Language Models and Data Augmentation Approaches,” Arxiv, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Michael Veale, Reuben Binns, and Lilian Edwards, “Algorithms that Remember: Model Inversion Attacks and Data Protection Law,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 376, no. 2133, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Tom B. Brown et al., “Language Models are Few-Shot Learners,” Arxiv, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Shiqing Fan et al., “DAPPLE: A Pipelined Data Parallel Approach for Training Large Models,” Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Korea, pp. 431-445, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[6] William Fedus, Barret Zoph, and Noam Shazeer, “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232-5270, 2021.
[Google Scholar] [Publisher Link]
[7] Amir Gholami et al.., “Integrated Model Batch and Domain Parallelism in Training Neural Networks,” Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures, Vienna Austria, pp. 77-86, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Priya Goyal et al., “Accurate Large Minibatch SGD: Training ImageNet in 1 Hour,” Arxiv, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Yanping Huang et al., “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[Google Scholar] [Publisher Link]
[10] Paras Jain et al., “Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization,” Proceedings of Machine Learning and Systems, vol. 2, pp. 497-511, 2020.
[Google Scholar] [Publisher Link]
[11] Shirui Pan et al., “Unifying Large Language Models and Knowledge Graphs: A Roadmap,” IEEE Transactions on Knowledge and Data Engineering, pp. 1-20, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[12] H. Schwenk, “Efficient Training of Large Neural Networks for Language Modeling,” IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), Budapest, Hungary, vol. 4, pp. 3059-3064, 2004.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Deepak Narayanan et al., “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM,” SC ’21: Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis, St. Louis Missouri, pp. 1-14, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Sudhashree Sayenjuet al., “Quantifying Domain Knowledge in Large Language Models,” IEEE Conference on Artificial Intelligence (CAI), Santa Clara, CA, USA, pp. 193-194, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Spyros Makridakis, Fotios Petropoulos, and Yanfei Kang, “Large Language Models: Their Success and Impact,” Forecasting, vol. 5, no. 3, pp. 536-549, 2023.
[CrossRef] [Google Scholar] [Publisher Link]