Будь ласка, використовуйте цей ідентифікатор, щоб цитувати або посилатися на цей матеріал: http://elartu.tntu.edu.ua/handle/lib/39655
Назва: Управління якістю даних в ETL процесі в умовах обмежених ресурсів
Інші назви: Data Quality Management in ETL Process under Resource Constraints
Автори: Кашосі, Асер
Kashosi, Aser
Приналежність: ТНТУ ім. І. Пулюя, Факультет комп’ютерно-інформаційних систем і програмної інженерії, Кафедра комп’ютерних наук, м. Тернопіль, Україна
Бібліографічний опис: Кашосі А. Управління якістю даних в ETL процесі в умовах обмежених ресурсів : кваліфікаційна робота освітнього рівня магістра за спеціальністю „124 – системний аналіз“ / А.Кашосі – Тернопіль : ТНТУ, 2022. – 60 с.
Дата публікації: 22-гру-2022
Дата подання: 7-гру-2022
Дата внесення: 27-гру-2022
Країна (код): UA
Місце видання, проведення: ТНТУ ім. І.Пулюя, ФІС, м. Тернопіль, Україна
Науковий керівник: Загородна, Наталія Володимирівна
Члени комітету: Голотенко, Олександр Сергійович
УДК: 004.04
Теми: Data Management Platform
Data Quality
ETL
Big Data
Stratified Sampling
Короткий огляд (реферат): Currently, access to data is necessary for many companies, particularly those engaged in marketing, to make decisions that will improve the quality of their services and businesses. They frequently find the knowledge they need from several sources in a variety of formats. Following a dedication to the quality of information offered to data consumers, a system will be implemented to consolidate all these data sources for analysis and decision-making. This study addresses the evaluation of data quality (DQ) in an ETL process developed to support a marketing data management platform. More specifically, this study addressed the problem of evaluating the quality of data with a high-volume trait. Addressing the problem of DQ assessment at high ingestion rates is beyond the scope of this study, which focuses on data quality assessment with limitations to vertical or horizontal scaling of the ingestion system. We also analyze the use of the model developed on real data to assume an improvement in the quality of the data in the ETL. The methodology used consisted of studying each feature related to the characterization of high-quality data and analyzing the impact of those in an ETL concerned with voluminous data. We propose algorithms for improving a more generalizable integration DQ assessment. We conducted a practical implementation study of the different criteria and characteristics proposed to evaluate the impact of the data collected throughout the process of data Extraction, Transformation, and Loading. We highlight a quality assessment framework that models the different necessary parts of the process, including data sources, metrics characterizing data quality, data destination, and the analysis and performance of the algorithms used in the assessment process. The ETL practical implementation in this research is based on a Direct Acyclic Graph (DAG) model, with the main purpose of extracting, transforming, and transmitting data from this first service to the rest of the Marketing Data Management Platform infrastructure, which is considered as the end user. The evaluation and quality are based on the development of algorithms that take source data as input in combination with predefined properties encompassing the expected result of the ETL transformation to produce the evaluation result. The evaluation findings may be used to support or contradict the standards for quality. Decisions are made in the event of a DQ failure to improve and enhance the data. We suggest including data checks at the very end of the ETL data manipulation process as well as a model for data volume reduction using algorithms that are intended to make the procedure more generic to enable quick review. The quality of the data evaluated during the test is a statistical representation of the ingested dataset, which provides an accurate profile that enables user applications to retrieve high-quality data without delay. The main contributions of this thesis are: i) the development of an ETL service in a Marketing Data Management Platform and ii) an examination of data reduction models with a view to assessing data quality. Chapter 1 presents a literature review of this research and describes the basic concepts and their definitions in other research, including sampling, ETL, Data Quality and Big Data. Chapter 2 We present the manner in which the ETL system fits into the framework of the data-management platform and how the entire architecture is modelled. Chapter 3 presents the outcomes of the experiment. The experimental findings, which were obtained using various types of actual data, are presented in this chapter. The performance over time and the effect of the startified sample are depicted in graphs. The closing part presents the conclusions of this thesis and discusses the prospective research.
Зміст: Data Quality Management in ETL Process under Resource Constraints 1 Notes 3 INTRODUCTION 9 1.1 ETL 12 1.1.1 ETL definitions 13 1.1.2 ETL tools review 14 1.2 Data quality 16 1.2.1 Data Quality Dimensions 17 1.2.2 Data Quality Objectives in the Context of ETL 18 1.2.3 ISO Data Quality Standards 19 1.3 Tcp-di benchmark 20 1.4 Big data 21 1.4.1 Vs and BIG Data 21 1.4.2 Batch processing and Big Data 22 1.5 Sampling for big data 23 1.5.1 A taxonomy for Big Data sampling techniques 24 CHAPTER 2. SYSTEM MODELING 26 2.1 ETL model 27 2.2 System architecture overview 29 2.2.1 Metadata store 30 2.2.2 Horizontal autoscaling environment 31 2.2.3 Workflow runner 33 CHAPTER 3. MEASUREMENT RESULTS 35 3.1 Estimation of the Population Mean 36 3.2 Performance evaluation of stratified random sampling for DQ assessment 37 CHAPTER 4. LABOUR PROTECTION AND SAFETY IN EMERGENCY 44 4.1 Introduction 44 4.2 Need for guidelines 45 4.2.1. Software quality 45 4.2.2 Static analysis 47 4.2.3 Automated static analysis tools 49 4.3 Universal standards 51 4.4 Challenges in safety critical systems 52 4.5 Similarities between Different Standards 53 4.6 Conclusion to safety 53 CONCLUSIONS 54 BIBLIOGRAPHY 56
URI (Уніфікований ідентифікатор ресурсу): http://elartu.tntu.edu.ua/handle/lib/39655
Власник авторського права: © Асер Кашосі, 2022
Перелік літератури: 1. Udofia, E., Buduka, S., Akpabio, J., Egwu, S., Udofia, E., & Olagunju, D. (2020). Digital Transformation: After the Big Data, What Next? В Day 1 Tue, August 11, 2020. SPE Nigeria Annual International Conference and Exhibition. SPE. https://doi.org/10.2118/203614-ms
2. De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review of key research topics. В AIP Conference Proceedings. INTERNATIONAL CONFERENCE ON INTEGRATED INFORMATION (ICININFO 2014): Proceedings of the 4th International Conference on Integrated Information. AIP Publishing LLC. https://doi.org/10.1063/1.4907823
3. Poess, M., Rabl, T., Jacobsen, H.-A., & Caufield, B. (2014). TPC-DI. В Proceedings of the VLDB Endowment (Вип. 7, Issue 13, с. 1367–1378). VLDB Endowment. https://doi.org/10.14778/2733004.2733009
4. Souissi, S., & BenAyed, M. (2016). GENUS: An ETL tool treating the Big Data Variety. В 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA). 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA). IEEE. https://doi.org/10.1109/aiccsa.2016.7945615
5. Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. (2002). Conceptual modeling for ETL processes. В Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP - DOLAP ’02. the 5th ACM international workshop. ACM Press. https://doi.org/10.1145/583890.583893
6. Sreemathy, J., Brindha, R., Selva Nagalakshmi, M., Suvekha, N., Karthick Ragul, N., & Praveennandha, M. (2021). Overview of ETL Tools and Talend-Data Integration. В 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS). 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS). IEEE. https://doi.org/10.1109/icaccs51430.2021.9441984
7. Vassiliadis, P. (2009). A Survey of Extract–Transform–Load Technology. В International Journal of Data Warehousing and Mining (Вип. 5, Issue 3, с. 1–27). IGI Global. https://doi.org/10.4018/jdwm.2009070101
8. Khan, M. A., Uddin, M. F., & Gupta, N. (2014). Seven V’s of Big Data understanding Big Data to extract value. В Proceedings of the 2014 Zone 1 Conference of the American Society for Engineering Education. 2014 Zone 1 Conference of the American Society for Engineering Education (ASEE Zone 1). IEEE. https://doi.org/10.1109/aseezone1.2014.6820689
9. Saeed, N., & Husamaldin, L. (2021). Big Data Characteristics (V’s) in Industry. В Iraqi Journal of Industrial Research (Вип. 8, Issue 1, с. 1–9). Corporation of Research and Industrial Development. https://doi.org/10.53523/ijoirvol8i1id52
10.RANJAN, J. (2019). The 10 Vs of Big Data framework in the Context of 5 Industry Verticals. В PRODUCTIVITY (Вип. 59, Issue 4, с. 324–342). Printspublications Private Limited. https://doi.org/10.32381/prod.2019.59.04.2
11.Chardonnens, T., Cudre-Mauroux, P., Grund, M., & Perroud, B. (2013). Big data analytics on high Velocity streams: A case study. В 2013 IEEE International Conference on Big Data. 2013 IEEE International Conference on Big Data. IEEE. https://doi.org/10.1109/bigdata.2013.6691653
12.Dayalan, M. (2018). MapReduce: Simplified Data Processing on Large Cluster. В International Journal of Research and Engineering (Вип. 5, Issue 5, с. 399–403). Marwah Infotech. https://doi.org/10.21276/ijre.2018.5.5.4
13.Benjelloun, S., Aissi, M. E. M. E., Loukili, Y., Lakhrissi, Y., Ali, S. E. B., Chougrad, H., & Boushaki, A. E. (2020). Big Data Processing: Batch-based processing and stream-based processing. В 2020 Fourth International Conference On Intelligent Computing in Data Sciences (ICDS). 2020 Fourth International Conference On Intelligent Computing in Data Sciences (ICDS). IEEE. https://doi.org/10.1109/icds50568.2020.9268684
14.Yaqoob, I., Chang, V., Gani, A., Mokhtar, S., Hashem, I. A. T., Ahmed, E., Anuar, N. B., & Khan, S. U. (2016). WITHDRAWN: Information fusion in social big data: Foundations, state-of-the-art, applications, challenges, and future research directions. В International Journal of Information Management. Elsevier BV. https://doi.org/10.1016/j.ijinfomgt.2016.04.014
15.Hashem, I. A. T., Anuar, N. B., Gani, A., Yaqoob, I., Xia, F., & Khan, S. U. (2016). MapReduce: Review and open challenges. В Scientometrics (Вип. 109, Issue 1, с. 389–422). Springer Science and Business Media LLC. https://doi.org/10.1007/s11192-016-1945-y
16.Aftab Ahmed Chandio, Nikos Tziritas, & Cheng-Zhong Xu. (2015). Big-Data Processing Techniques and Their Challenges in Transport Domain. ZTE Communications, 13(1), 50–59. https://doi.org/10.3969/j.issn.1673- 5188.2015.01.007
17.Taleb, I., Serhani, M. A., Bouhaddioui, C., & Dssouli, R. (2021). Big data quality framework: a holistic approach to continuous quality management. В Journal of Big Data (Вип. 8, Issue 1). Springer Science and Business Media LLC. https://doi.org/10.1186/s40537-021-00468-0
18.Wang, R. Y., & Strong, D. M. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. В Journal of Management Information Systems (Вип. 12, Issue 4, с. 5–33). Informa UK Limited. https://doi.org/10.1080/07421222.1996.11518099
19.Souibgui, M., Atigui, F., Zammali, S., Cherfi, S., & Yahia, S. B. (2019). Data quality in ETL process: A preliminary study. В Procedia Computer Science (Вип. 159, с. 676–687). Elsevier BV. https://doi.org/10.1016/j.procs.2019.09.223
20.B., S., M., T., & A., A. (2015). Automated ETL Testing on the Data Quality of a Data Warehouse. В International Journal of Computer Applications (Вип. 131, Issue 16, с. 9–16). Foundation of Computer Science. https://doi.org/10.5120/ijca2015907590
21.Cai, L., & Zhu, Y. (2015). The Challenges of Data Quality and Data Quality Assessment in the Big Data Era. В Data Science Journal (Вип. 14, Issue 0, с. 2). Ubiquity Press, Ltd. https://doi.org/10.5334/dsj-2015-002
22.Hellerstein, J. M. (2018). Looking back at Postgres. В Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker (с. 205–224). Association for Computing Machinery. https://doi.org/10.1145/3226595.3226614
23.Stonebraker, M., Rowe, L. A., & Hirohama, M. (1990). The implementation of POSTGRES. В IEEE Transactions on Knowledge and Data Engineering (Вип. 2, Issue 1, с. 125–142). Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/69.50912
24.Nguyen, T.-T., Yeom, Y.-J., Kim, T., Park, D.-H., & Kim, S. (2020). Horizontal Pod Autoscaling in Kubernetes for Elastic Container Orchestration. В Sensors (Вип. 20, Issue 16, с. 4621). MDPI AG. https://doi.org/10.3390/s20164621
25.Cormode, G., & Duffield, N. (2014). Sampling for big data. В Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. KDD ’14: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. https://doi.org/10.1145/2623330.2630811
26.Zhang, H. L., Liu, J., Li, T., Xue, Y., Xu, S., & Chen, J. (2017). Extracting sample data based on poisson distribution. 2017 International Conference on Machine Learning and Cybernetics (ICMLC). https://doi.org/10.1109/icmlc.2017.8108950
27.Zhang, H. L., Zhao, Y., Pang, C., & He, J. (2020). Splitting Large Medical Data Sets Based on Normal Distribution in Cloud Environment. В IEEE Transactions on Cloud Computing (Вип. 8, Issue 2, с. 518–531). Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/tcc.2015.2462361
28.Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., & Tallent, N. R. (2009). HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, n/a-n/a. https://doi.org/10.1002/cpe.1553
29.Gualo, F., Rodriguez, M., Verdugo, J., Caballero, I., & Piattini, M. (2021). Data quality certification using ISO/IEC 25012: Industrial experiences. Journal of Systems and Software, 176, 110938. https://doi.org/10.1016/j.jss.2021.110938
30.International Organization for Standardization. (n.d.). ISO/IEC 25012. iso25000.com. Retrieved December 9, 2022, from https://iso25000.com/index.php/en/iso-25000-standards/iso-25012
31.(n.d.). A study on sampling techniques for data testing. International Journal of Computer Science and Communication (IJCSC). http://www.csjournals.com/IJCSC/PDF3-1/Article_3.pdf
32.Zhao, X., Liang, J., & Dang, C. (2019). A stratified sampling based clustering algorithm for large-scale data. Knowledge-Based Systems, 163, 416–428. https://doi.org/10.1016/j.knosys.2018.09.007
33.Arnab, R. (2017). Stratified Sampling. Survey Sampling Theory and Applications, 213–256. https://doi.org/10.1016/b978-0-12-811848-1.00007-8
34.Lane. (n.d.). Transaction Processing in PostgreSQL (By postgresql). www.postgresql.org. Retrieved December 20, 2022, from https://www.postgresql.org/files/developer/transactions.pdf
35.Architecture Overview — Airflow Documentation. (n.d.). https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html
36.Kubernetes Components. (2022, October 24). Kubernetes. https://kubernetes.io/docs/concepts/overview/components/
37.Nguyen, T. T., Yeom, Y. J., Kim, T., Park, D. H., & Kim, S. (2020). Horizontal Pod Autoscaling in Kubernetes for Elastic Container Orchestration. Sensors, 20(16), 4621. https://doi.org/10.3390/s20164621
38.Chaya Addagarrala, K., & Kinnicutt, P. (2018). Safety critical software ground rules. International Journal of Engineering &Amp; Technology, 7(2.28), 344. https://doi.org/10.14419/ijet.v7i2.28.13209
Тип вмісту: Master Thesis
Розташовується у зібраннях:124 — системний аналіз

Файли цього матеріалу:
Файл Опис РозмірФормат 
А. Kashosi.pdf817,93 kBAdobe PDFПереглянути/відкрити


Усі матеріали в архіві електронних ресурсів захищені авторським правом, всі права збережені.

Інструменти адміністратора