An Analysis of Data Quality Requirements for Machine Learning Development Pipelines Frameworks

  IJCTT-book-cover
 
         
 
© 2023 by IJCTT Journal
Volume-71 Issue-8
Year of Publication : 2023
Authors : Sandeep Rangineni
DOI :  10.14445/22312803/IJCTT-V71I8P103

How to Cite?

Sandeep Rangineni, "An Analysis of Data Quality Requirements for Machine Learning Development Pipelines Frameworks," International Journal of Computer Trends and Technology, vol. 71, no. 8, pp. 16-27, 2023. Crossref, https://doi.org/10.14445/22312803/IJCTT-V71I8P103

Abstract
The importance of meeting data quality standards in the context of Machine Learning (ML) development pipelines is explored in this study. It delves deep into why good data is crucial to confidently deploying ML models. The primary goal of this research is to isolate and examine the most important aspects of data quality inside ML pipelines and how they affect model performance and generalizability. The study highlights the complex connection between data quality and ML model performance via an in-depth analysis of multiple phases within the ML pipeline, encompassing data collection, preprocessing, model training, and validation. The study highlights the importance of data quality in reducing bias, improving predicting accuracy, and making ML models more robust to outside influences. The study elaborates on the possible consequences of ignoring data quality issues by highlighting the difficulties given by data noise, incompleteness, and biases. Accuracy, consistency, completeness, relevance, and ethical issues are all part of the data quality criteria that are spelt forth. The study's relevance rests on providing a holistic perspective on the crucial importance of data quality within the landscape of ML development. The survey results provide ML professionals and businesses with a better appreciation for the importance of high-quality data in building trustworthy ML models. Trust in ML model outputs, adoption of ethical data practices, and effective dissemination of ML tools are all facilitated by their corresponding data quality needs being recognized and met.

Keywords
Data innovation, Data ecosystems, Machine learning, Data quality, Data management.

Reference

[1] Amina Adadi, and Mohammed Berrada, “Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI),” IEEE Access, vol. 6, pp. 52138–52160, 2018. [CrossRef] [Google Scholar] [Publisher Link]
[2] Ariful Islam Anik, and Andrea Bunt, “Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency,” Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–13, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Lora Aroyo et al., “Data Excellence for AI: Why Should You Care?,” Interactions, vol. 29, no. 2, pp. 66–69, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Alejandro Barredo Arrieta et al., “Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI,” Information Fusion, vol. 58, pp. 82-115, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Rob Ashmore, Radu Calinescu, and Colin Paterson, “Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges,” ACM Computing Surveys, vol. 54, no. 5, pp. 1–39, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Jacqui Ayling, and Adriane Chapman, “Putting AI Ethics to Work: Are the Tools Fit for Purpose?,” AI and Ethics, vol. 2, pp. 405–429, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Yang Baolong, Wu Hong, and Zhang Haodong, “Research and Application of Data Management Based on Data Management Maturity Model (DMM),” Proceedings of the 2018 10th International Conference on Machine Learning and Computing. pp. 157–160, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Rachel K. E. Bellamy et al., “AI Fairness 360: An Extensible Toolkit for Detecting and Mitigating Algorithmic Bias,” IBM Journal of Research and Development, vol. 63, no. 4-5, pp. 4:1 - 4:15, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Emily M. Bender, and Batya Friedman, “Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 587–604, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Emily M. Bender et al., “On the dangers of Stochastic Parrots: Can Language Models be Too Big?,” Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Laure Berti-Equille, “Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation,” WWW '19: The World Wide Web Conference, pp. 2580–2586, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Leopoldo Bertossi, and Floris Geerts, “Data Quality and Explainable AI,” Journal of Data and Information Quality, vol. 12, no. 2. pp. 1– 9, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Andrew Black, and Peter van Nederpelt, “Dimensions of Data Quality (DDQ) Research Paper,” DAMA NL Foundation, pp. 1-113, 2020.
[Google Scholar] [Publisher Link]
[14] Tolga Bolukbasi et al., “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,” Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 4356–4364, 2016.
[Google Scholar] [Publisher Link]
[15] Rishi Bommasani et al., “On the Opportunities and Risks of Foundation Models.” ArXiv, pp. 1-214, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Paula Branco, Luís Torgo, and Rita P. Ribeiro, “A survey of Predictive Modeling on Imbalanced Domains,” ACM Computing Surveys, vol. 49, no. 2, pp. 1–50, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Samuel Budd, Emma C. Robinson, and Bernhard Kainz, “A Survey on Active Learning and Human-in-the-Loop Deep Learning for Medical Image Analysis,” Medical Image Analysis, vol. 71, p. 102062, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Wo Chang, “ISO/IEC JTC 1/SC 42(AI)/WG 2(Data) Data Quality for Analytics and Machine Learning (ML),” Information Technology Laboratory, 2022.
[Google Scholar] [Publisher Link]
[19] Haihua Chen, Jiangping Chen, and Junhua Ding, “Data Evaluation and Enhancement for Quality Improvement of Machine Learning,” IEEE Transactions on Reliability, vol. 70, no. 2, pp. 831–847, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Catherine D’Ignazio, and Lauren F. Klein, Data Feminism, Cambridge: Massachusetts Institute of Technology, 2020.
[Google Scholar] [Publisher Link]
[21] Lisa Ehrlinger et al., “A DaQL to Monitor Data Quality in Machine Learning Applications,” International Conference on Database and Expert Systems Applications, pp. 227–237, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, “From Data Mining to Knowledge Discovery in Databases,” AI Magazine, vol. 17, no. 3, pp 37–54, 1996.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Timnit Gebru et al., “Datasheets for Datasets,” Communications of the ACM, vol. 64, no. 12, pp. 86–92, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Fernando Gualo et al., “Data Quality Certification using ISO/IEC 25012: Industrial Experiences,” Journal of Systems and Software, vol. 176, p. 110938, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Venkat Gudivada, Amy Apon, and Junhua Ding, “Data Quality Considerations for Big Data and Machine Learning: Going Beyond Data Cleaning and Transformations,” International Journal on Advances in Software, vol. 10, no. 1, pp. 1–20, 2017.
[Google Scholar] [Publisher Link]
[26] David Gundry, and Sebastian Deterding, “Trading Accuracy for Enjoyment? Data Quality and Player Experience in Data Collection Games,” Proceedings of the CHI Conference on Human Factors in Computing Systems, no. 156, pp. 1–14, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Nitin Gupta et al., “Data Quality for Machine Learning Tasks,” Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4040–4041, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Thilo Hagendorff, “Linking Human and Machine Behavior: A New Approach to Evaluate Training Data Quality for Beneficial Machine Learning,” Minds and Machines, vol. 31, pp. 563–593, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Haibo He, and Edwardo A. Garcia, “Learning from Imbalanced Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Deborah Henderson, and Susan Earley, DAMA-DMBOK: Data Management Body of Knowledge, 2nd ed., Technics Publications, p. 624, 2017.
[Google Scholar] [Publisher Link]
[31] Fred Hohman et al., “Understanding and Visualizing Data Iteration in Machine Learning,” Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–13, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Sarah Holland et al., The Dataset Nutrition Label, Data Protection and Privacy, vol. 12, no. 12, 2020.
[Google Scholar] [Publisher Link]
[33] Andreas Holzinger, “From Machine Learning to Explainable AI,” World Symposium on Digital Intelligence for Systems and Machines (DISA’18), 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Sara Hooker, “Moving Beyond “Algorithmic Bias is a Data Problem,” Patterns, vol. 2, no. 4, p. 100241, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[35] Ben Hutchinson et al., “Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure,” Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 560–575, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Eun Seo Jo, and Timnit Gebru, “Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning,” Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 306–316, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Michael I. Jordan, and Tom M. Mitchell, “Machine Learning: Trends, Perspectives, and Prospects,” Science, vol. 349, no. 6245, pp. 255– 260, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Ashish Juneja, and Nripendra Narayan Das, “Big Data Quality Framework: Pre-Processing Data in Weather Monitoring Application,” International Conference on Machine Learning, Big Data, Cloud, and Parallel Computing (COMITCon’19), pp. 559–563, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Daniel S. Katz et al., “Software vs. Data in the Context of Citation,” PeerJ Preprints, pp. 1-4, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[40] Guy Katz et al., “Towards Proving the Adversarial Robustness of Deep Neural Networks,” Arxiv, pp. 19-26, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[41] Sunho Kim et al., “Organizational Process Maturity Model for IoT Data Quality Management,” Journal of Industrial Information Integration, vol. 26, p. 100256, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[42] Laura Koesten et al., “Everything you Always Wanted to Know about a Dataset: Studies in Data Summarisation,” International Journal of Human-Computer Studies, vol. 135, p. 102367, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[43] Dominik Kreuzberger, Niklas Kühl, and Sebastian Hirschl, “Machine Learning Operations (MLOps): Overview, Definition, and Architecture,” ArXiv, 2022.
[CrossRef] [Publisher Link]
[44] Sampo Kuutti et al., “A Survey of Deep Learning Applications to Autonomous Vehicle Control,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 2, pp. 712–733, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[45] Aleksander Madry et al., “Towards Deep Learning Models Resistant to Adversarial Attacks,” ArXiv, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[46] Ninareh Mehrabi et al., “A Survey on Bias and Fairness in Machine Learning,” ACM Computing Surveys, vol. 54, no. 6, pp. 1–35, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[47] Merino Jorge et al., “A Data Quality in Use Model for Big Data,” Future Generation Computer Systems, vol. 63, pp. 123–130, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[48] Margaret Mitchell et al., “Model Cards for Model Reporting,” Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 220–229, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[49] Tanushree Mitra, Clayton J. Hutto, and Eric Gilbert, “Comparing Person-and Process-Centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk,” Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1345–1354, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[50] Jose G. Moreno-Torres et al., “A Unifying View on Dataset Shift in Classification,” Pattern Recognition, vol. 45, no. 1, pp. 521–530, 2012.
[CrossRef] [Google Scholar] [Publisher Link]
[51] Eirini Ntoutsi et al., “Bias in Data-Driven Artificial Intelligence Systems–An Introductory Survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, pp. 1-14, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[52] Andrei Paleyes, Raoul-Gabriel Urma, and Neil D. Lawrence, “Challenges in Deploying Machine Learning: A Survey of Case Studies,” ACM Computing Surveys, vol. 55, no. 6, pp. 1–29, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[53] Amandalynne Paullada et al., “Data and its (dis)Contents: A Survey of Dataset Development and Use in Machine Learning Research,” Patterns, vol. 2, no. 11, pp. 1-14, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[54] Kai Petersen et al., “Systematic Mapping Studies in Software Engineering,” Proceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering, pp. 68–77, 2008.
[CrossRef] [Google Scholar] [Publisher Link]
[55] Joelle Pineau et al., “Improving Reproducibility in Machine Learning Research (a Report from the NeurIPS 2019 Reproducibility Program),” Journal of Machine Learning Research, vol. 22, no. 1, pp. 7459–7478, 2021.
[Google Scholar] [Publisher Link]
[56] Claudio Santos Pinhanez et al., “Integrating Machine Learning Data with Symbolic Knowledge from Collaboration Practices of Curators to Improve Conversational Systems,” Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–13, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[57] Neoklis Polyzotis et al., “Data Lifecycle Challenges in Production Machine Learning: A survey,” ACM SIGMOD Record, vol. 47, no. 2, pp. 17–28, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[58] Jorge Ramírez et al., “On the State of Reporting in Crowdsourcing Experiments and a Checklist to Aid Current Practices,” Proceedings of the ACM on Human-Computer Interaction, vol. 5, no. 2, pp. 1–34, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[59] Jimmy Rising, “Justice and Ethics,” Massachusetts Institute of Technology MIT, Cambridge, MA, Report., 2002.
[Publisher Link]
[60] Anna Rogers, Tim Baldwin, and Kobi Leins, “Just What do You Think you’re Doing, Dave? A Checklist for Responsible Data Use in NLP,” ArXiv, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[61] Yuji Roh, Geon Heo, and Steven Euijong Whang, “A Survey on Data Collection for Machine Learning: A Big Data-AI Integration Perspective,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 4, pp. 1328–1347, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[62] Annabel Rothschild et al., “Towards Fair and Pro-Social Employment of Digital Pieceworkers for Sourcing Machine Learning Training Data,” Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–9, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[63] Tammo Rukat, Dustin Lange, Sebastian Schelter, and Felix Biessmann, “Towards Automated Data Quality Management for Machine Learning,” Proceedings of the Workshop on MLOps Systems at the 3rd Conference on Machine Learning and Systems, pp. 1–3, 2020.
[Google Scholar] [Publisher Link]
[64] Nithya Sambasivan et al., “Everyone Wants to do the Model Work, Not the Data Work”: Data Cascades in High-Stakes AI,” Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[65] Sebastian Schelter et al., “Deequ-Data Quality Validation for Machine Learning Pipelines,” Proceedings of the Machine Learning Systems Workshop at the Conference on Neural Information Processing Systems, 2018.
[Publisher Link]
[66] Shreya Shankar et al., “No Classification Without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World,” ArXiv, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[67] Daniel Staegemann et al., “Determining Potential Failures and Challenges in Data-Driven Endeavors: A Real World Case Study Analysis,” Proceedings of the 5th International Conference on Internet of Things, Big Data and Security, pp. 453–460, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[68] Ikbal Taleb et al., “Big Data Quality Framework: A Holistic Approach to Continuous Quality Management,” Journal of Big Data, vol. 8, pp. 1–41, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[69] Linnet Taylor, “What is Data Justice? The Case for Connecting Digital Rights and Freedoms Globally,” Big Data and Society, vol. 4, no. 2, pp. 1-14, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[70] Divy Thakkar et al., “When is Machine Learning Data Good?: Valuing in Public Health Datafication,” Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–16, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[71] Jennifer Wortman Vaughan, “Making Better Use of the Crowd: How Crowdsourcing can Advance Machine Learning Research,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 1-46, 2017.
[Google Scholar] [Publisher Link]
[72] April Yi Wang et al., “What Makes a Well-Documented Notebook? A Case Study of Data Scientists’ Documentation Practices in Kaggle,” Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[73] Ding Wang, Shantanu Prabhat, and Nithya Sambasivan, “Whose AI dream? In Search of the Aspiration in Data Annotation,” Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–16, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[74] Richard Y. Wang, and Diane M. Strong, “Beyond Accuracy: What Data Quality Means to Data Consumers,” Journal of Management Information Systems, vol. 12, no. 4, pp. 5–33, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[75] Martin J. Willemink, Wojciech A. Koszek, Cailin Hardell, Jie Wu, Dominik Fleischmann, Hugh Harvey, Les R. Folio, Ronald M. Summers, Daniel L. Rubin, and Matthew P. Lungren. 2020. “Preparing Medical Imaging Data for Machine Learning,” Radiology, 295, no. 1, pp. 4–15, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[76] Eric Wong, and Zico Kolter, “Provable Defenses against Adversarial Examples via the Convex Outer Adversarial Polytope,” Proceedings of the International Conference on Machine Learning, pp. 5286–5295, 2018.
[Google Scholar] [Publisher Link]
[77] Amrapali Zaveri et al., “Quality Assessment for Linked Data: A Survey,” Semantic Web, vol. 7, no. 1, pp. 63–93, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[78] Sandeep Ranginenin, Arvind Kumar Bhardwaj, and Divya Marupaka, “An Overview and Critical Analysis of Recent Advances in Challenges Faced in Building Data Engineering Pipelines for Streaming Media,” The Review of Contemporary Scientific and Academic Studies, vol. 3, no. 6, pp. 1-5, 2023.
[CrossRef] [Publisher Link]
[79] Divya Marupaka, Sandeep Rangineni, and Arvind Kumar Bhardwaj, “Data Pipeline Engineering in the Insurance Industry: A Critical Analysis of ETL Frameworks, Integration Strategies, and Scalability,” International Journal of Creative Research Thoughts, vol. 11, no. 6, pp. 530-539, 2023.
[CrossRef] [Publisher Link]
[80] Sandeep Rangineni, Divya Marupaka, and Arvind Kumar Bhardwaj, “An Examination of Machine Learning in the Process of Data Integration,” SSRG International Journal of Computer Trends and Technology, vol. 71, no. 6, 2023.
[CrossRef] [Publisher Link]