Data-Driven Innovation: The Potential of Synthetic Data through Generative AI
Chan, Chung Yin (2024)
Chan, Chung Yin
2024
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-202405079846
https://urn.fi/URN:NBN:fi:amk-202405079846
Tiivistelmä
Data scarcity constitutes a protracted challenge to AI decision-making across data-driven industries. Research has shown that demand for larger training datasets is growing, but the current method for data collection has become restricted due to scarcity and privacy concerns. This thesis aims to solve data scarcity through synthetic data generation by exploring the capability of Generative AI.
Based on a literature review on Generative AI and synthetic data, the crucial insight is that synthetic data can substitute insufficient real-world data in limitless quantities. While preserving the statistical representation of the original data (i.e. existing real-world data), synthetic data enhances the AI model development process, eliminates the risk of sensitive data exposure, and mitigates biases. In addition, a demonstration of synthetic data generation is presented using a Variational Autoencoder model (VAE) and a customized open-source library (Syngen) to reinforce the presented findings. Analysis of the output shows that generated synthetic data exhibits less than 2% variation with the original dataset when used for the same prediction task.
The results suggest that synthetic data can be a practical solution to overcome limitations in data acquisition. On this basis, it is recommended that data-driven industries utilize synthetic data to unleash new opportunities in the current data-driven era. Further research is required to identify possible factors that could diminish data cardinality while handling complex categorical data, aiming to improve the reliability and effectiveness of synthetic data.
Based on a literature review on Generative AI and synthetic data, the crucial insight is that synthetic data can substitute insufficient real-world data in limitless quantities. While preserving the statistical representation of the original data (i.e. existing real-world data), synthetic data enhances the AI model development process, eliminates the risk of sensitive data exposure, and mitigates biases. In addition, a demonstration of synthetic data generation is presented using a Variational Autoencoder model (VAE) and a customized open-source library (Syngen) to reinforce the presented findings. Analysis of the output shows that generated synthetic data exhibits less than 2% variation with the original dataset when used for the same prediction task.
The results suggest that synthetic data can be a practical solution to overcome limitations in data acquisition. On this basis, it is recommended that data-driven industries utilize synthetic data to unleash new opportunities in the current data-driven era. Further research is required to identify possible factors that could diminish data cardinality while handling complex categorical data, aiming to improve the reliability and effectiveness of synthetic data.
Kokoelmat
Samankaltainen aineisto
Näytetään aineisto, joilla on samankaltaisia nimekkeitä, tekijöitä tai asiasanoja.
-
Data Strategy Handbook as Guide Towards Data-Driven Organization
Piippola, Timo-Joel (2024)The need for an organizational data culture is evident in the digital era. More organizations are making data-driven decisions, viewing data as a crucial business asset. This thesis aimed to help a case company enhance its ... -
Big datan käyttö liiketoiminnan ennustamiseen: tieliikenneonnettomuudet Suomessa
Alto, Olga (2019)Tämän opinnäytetyön tarkoituksena on selvittää, mitä tietoja voidaan ennustaa suurista tietomääristä. Aineistona on käytetty Suomessa liikennetapaturmia koskevia avoimia lähteitä vuosilta 2015 – 2017. Työssä ennustetaan ... -
Recognizing the value of data in business operations : Data analytics for business operation
Duma, Don (2022)The aim of this study was to demonstrate the hidden value of data that can be extracted with few commercial and open-source software tools. Any given business can collect, organize, and extract data for analysis that can ...