Before discussing data mining, we need to understand the fundamentals behind Data. Data is a set of discrete facts about an event, a process, a measurement, or an analysis. We collect and create Data every day, a few examples are Data from our IOT (Internet Of Things) devices, text documents, spatial data, multimedia channels and hypertext documents. However, they have little use by themselves unless converted into information, this information is easily extracted when the amount or size of data is relatively small but when it exceeds a certain amount it becomes difficult to derive the right information or it takes a lot of time to process. Here comes Data mining.
According to Wikipedia Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. But according to William J.Frawley “Data mining or KDD(Knowledge Discovery in Databases) as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data.” I prefer the latter definition as it pinpoints what we want from Data which is to retrieve “useful information”. We could even say that data mining is the set of techniques applied to mine or extract meaning from a huge amount of Data.
Data mining is interdisciplinary and uses knowledge from the areas of computer science, mathematics and statistics for the computer-aided analysis of data sets. Among other things, artificial intelligence methods are used to examine large data sets for new cross-connections, trends or patterns. Data mining automatically extracts the connections and makes them available to higher-level goals. The patterns identified can help facilitate decision-making on specific problems.
Companies use data mining software to learn more about their customers. It can help them to develop more effective marketing strategies, increase sales, and decrease costs.
Now that we know what Data mining is, let's dive into how it is done. Data mining is typically done by data scientists and other skilled BI and analytics professionals and the steps required to turn the Data from some raw chunks of meaningless aggregate to meaningful information can be organized into Data gathering, Data preparation, Data analysis. Let's see what the activities at each step are.
The first step is all about how to get the “Relevant data”. Relevant Data is the Data which contains business information according to the goal we want to achieve. One needs first to reflect upon what the company is trying to achieve by mining Data. What is their current business situation? Do the company have the set of Data which align with the goal? If not how to get?
After the Relevant Data is identified we can start the gathering process. The data may be located in different source systems, a data warehouse, a data lake, or an increasingly common repository in big data environments that may contain a mix of structured and unstructured data. External data sources may also be used. Wherever the data comes from, the data scientist must move if necessary to a data lake for the remaining steps.
Now that we have a goal and a Data set which matches the goal, it is required to understand the Data. This stage includes a set of steps to get the data ready to be mined. It starts with data exploration, profiling and pre-processing, followed by data cleansing work to fix errors and other data quality issues. Data transformation is also done to make data sets consistent.
With our clean data set in hand, it's time to do the magic. Data scientists can use various techniques such as Association rules, Classification, Clustering, Decision trees , K-Nearest neighbor (KNN), Neural networks, Predictive analysis to search for relationships, trends, associations, or sequential patterns.
The data-centered aspect of data mining concludes by assessing the findings of the data model or models. The outcomes from the analysis may be aggregated, interpreted, and presented to decision-makers that have largely been excluded from the data mining process to this point.
And one of the best ways to present the results to decision-makers, to business executives and users, is through data visualization and the use of data storytelling techniques. Organizations can then choose to make decisions based on the findings.
In today's age of information, almost any department, industry, sector, or company can make use of data mining. There are various use cases in sales, marketing, manufacturing, fraud detection but even if Data mining drives profitability and efficiency it is quite complex, and the results and benefits are not always guaranteed.