Gegevens Mining Quick Guide

There is a yam-sized amount of gegevens available te the Information Industry. This gegevens is of no use until it is converted into useful information. It is necessary to analyze this thick amount of gegevens and samenvatting useful information from it.

Extraction of information is not the only process wij need to perform, gegevens mining also involves other processes such spil Gegevens Cleaning, Gegevens Integration, Gegevens Transformation, Gegevens Mining, Pattern Evaluation and Gegevens Presentation. Once all thesis processes are overheen, wij would be able to use this information ter many applications such spil Fraud Detection, Market Analysis, Production Control, Science Exploration, etc.

What is Gegevens Mining?

Gegevens Mining is defined spil extracting information from large sets of gegevens. Te other words, wij can say that gegevens mining is the proces of mining skill from gegevens. The information or skill extracted so can be used for any of the following applications &minus,

Gegevens Mining Applications

Gegevens mining is very useful ter the following domains &minus,

Market Analysis and Management

Corporate Analysis &, Risk Management

Bijzonder from thesis, gegevens mining can also be used te the areas of production control, customer retention, science exploration, sports, astrology, and Internet Web Surf-Aid

Market Analysis and Management

Listed below are the various fields of market where gegevens mining is used &minus,

Customer Profiling &minus, Gegevens mining helps determine what zuigeling of people buy what kleintje of products.

Identifying Customer Requirements &minus, Gegevens mining helps te identifying the best products for different customers. It uses prediction to find the factors that may attract fresh customers.

Cross Market Analysis &minus, Gegevens mining performs Association/correlations inbetween product sales.

Target Marketing &minus, Gegevens mining helps to find clusters of specimen customers who share the same characteristics such spil interests, spending habits, income, etc.

Determining Customer purchasing pattern &minus, Gegevens mining helps ter determining customer purchasing pattern.

Providing Summary Information &minus, Gegevens mining provides us various multidimensional summary reports.

Corporate Analysis and Risk Management

Gegevens mining is used ter the following fields of the Corporate Sector &minus,

Finance Programma and Asset Evaluation &minus, It involves specie flow analysis and prediction, quotum voorkoop analysis to evaluate assets.

Resource Programma &minus, It involves summarizing and comparing the resources and spending.

Competition &minus, It involves monitoring competitors and market directions.

Fraud Detection

Gegevens mining is also used te the fields of credit card services and telecommunication to detect frauds. Ter fraud telephone calls, it helps to find the destination of the call, duration of the call, time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.

Gegevens mining deals with the kleintje of patterns that can be mined. On the ondergrond of the zuigeling of gegevens to be mined, there are two categories of functions involved ter Gegevens Mining &minus,

Descriptive Function

The descriptive function deals with the general properties of gegevens te the database. Here is the list of descriptive functions &minus,

  • Class/Concept Description
  • Mining of Frequent Patterns
  • Mining of Associations
  • Mining of Correlations
  • Mining of Clusters

Class/Concept Description

Class/Concept refers to the gegevens to be associated with the classes or concepts. For example, te a company, the classes of items for sales include laptop and printers, and concepts of customers include big spenders and budget spenders. Such descriptions of a class or a concept are called class/concept descriptions. Thesis descriptions can be derived by the following two ways &minus,

Gegevens Characterization &minus, This refers to summarizing gegevens of class under investigate. This class under explore is called spil Target Class.

Gegevens Discrimination &minus, It refers to the mapping or classification of a class with some predefined group or class.

Mining of Frequent Patterns

Frequent patterns are those patterns that occur frequently ter transactional gegevens. Here is the list of zuigeling of frequent patterns &minus,

Frequent Voorwerp Set &minus, It refers to a set of items that frequently show up together, for example, milk and bread.

Frequent Subsequence &minus, A sequence of patterns that occur frequently such spil purchasing a camera is followed by memory card.

Frequent Sub Structure &minus, Substructure refers to different structural forms, such spil graphs, trees, or lattices, which may be combined with item-sets or subsequences.

Mining of Association

Associations are used te retail sales to identify patterns that are frequently purchased together. This process refers to the process of uncovering the relationship among gegevens and determining association rules.

For example, a retailer generates an association rule that shows that 70% of time milk is sold with bread and only 30% of times biscuits are sold with bread.

Mining of Correlations

It is a kleintje of extra analysis performed to uncover interesting statistical correlations inbetween associated-attribute-value pairs or inbetween two voorwerp sets to analyze that if they have positive, negative or no effect on each other.

Mining of Clusters

Cluster refers to a group of similar kleuter of objects. Cluster analysis refers to forming group of objects that are very similar to each other but are very different from the objects ter other clusters.

Classification and Prediction

Classification is the process of finding a prototype that describes the gegevens classes or concepts. The purpose is to be able to use this proefje to predict the class of objects whose class label is unknown. This derived monster is based on the analysis of sets of training gegevens. The derived monster can be introduced te the following forms &minus,

  • Classification (IF-THEN) Rules
  • Decision Trees
  • Mathematical Formulae
  • Neural Networks

The list of functions involved ter thesis processes are spil goes after &minus,

Classification &minus, It predicts the class of objects whose class label is unknown. Its objective is to find a derived specimen that describes and distinguishes gegevens classes or concepts. The Derived Prototype is based on the analysis set of training gegevens i.e. the gegevens object whose class label is well known.

Prediction &minus, It is used to predict missing or unavailable numerical gegevens values rather than class labels. Regression Analysis is generally used for prediction. Prediction can also be used for identification of distribution trends based on available gegevens.

Outlier Analysis &minus, Outliers may be defined spil the gegevens objects that do not conform with the general behavior or prototype of the gegevens available.

Evolution Analysis &minus, Evolution analysis refers to the description and monster regularities or trends for objects whose behavior switches overheen time.

Gegevens Mining Task Primitives

  • Wij can specify a gegevens mining task te the form of a gegevens mining query.
  • This query is input to the system.
  • A gegevens mining query is defined te terms of gegevens mining task primitives.

Note &minus, Thesis primitives permit us to communicate ter an interactive manner with the gegevens mining system. Here is the list of Gegevens Mining Task Primitives &minus,

  • Set of task relevant gegevens to be mined.
  • Kleuter of skill to be mined.
  • Background skill to be used te discovery process.
  • Interestingness measures and thresholds for pattern evaluation.
  • Representation for visualizing the discovered patterns.

Set of task relevant gegevens to be mined

This is the portion of database ter which the user is interested. This portion includes the following &minus,

  • Database Attributes
  • Gegevens Warehouse dimensions of rente

Kleuter of skill to be mined

It refers to the kleintje of functions to be performed. Thesis functions are &minus,

  • Characterization
  • Discrimination
  • Association and Correlation Analysis
  • Classification
  • Prediction
  • Clustering
  • Outlier Analysis
  • Evolution Analysis

Background skill

The background skill permits gegevens to be mined at numerous levels of abstraction. For example, the Concept hierarchies are one of the background skill that permits gegevens to be mined at numerous levels of abstraction.

Interestingness measures and thresholds for pattern evaluation

This is used to evaluate the patterns that are discovered by the process of skill discovery. There are different interesting measures for different zuigeling of skill.

Representation for visualizing the discovered patterns

This refers to the form te which discovered patterns are to be displayed. Thesis representations may include the following. &minus,

Gegevens mining is not an effortless task, spil the algorithms used can get very elaborate and gegevens is not always available at one place. It needs to be integrated from various heterogeneous gegevens sources. Thesis factors also create some issues. Here te this tutorial, wij will discuss the major issues regarding &minus,

  • Mining Methodology and User Interaction
  • Show Issues
  • Diverse Gegevens Types Issues

The following diagram describes the major issues.

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues &minus,

Mining different kinds of skill ter databases &minus, Different users may be interested ter different kinds of skill. Therefore it is necessary for gegevens mining to voorkant a broad range of skill discovery task.

Interactive mining of skill at numerous levels of abstraction &minus, The gegevens mining process needs to be interactive because it permits users to concentrate the search for patterns, providing and refining gegevens mining requests based on the returned results.

Incorporation of background skill &minus, To guide discovery process and to express the discovered patterns, the background skill can be used. Background skill may be used to express the discovered patterns not only ter concise terms but at numerous levels of abstraction.

Gegevens mining query languages and ad hoc gegevens mining &minus, Gegevens Mining Query language that permits the user to describe ad hoc mining tasks, should be integrated with a gegevens warehouse query language and optimized for efficient and limber gegevens mining.

Presentation and visualization of gegevens mining results &minus, Once the patterns are discovered it needs to be voiced te high level languages, and visual representations. Thesis representations should be lightly understandable.

Treating noisy or incomplete gegevens &minus, The gegevens cleaning methods are required to treat the noise and incomplete objects while mining the gegevens regularities. If the gegevens cleaning methods are not there then the accuracy of the discovered patterns will be poor.

Pattern evaluation &minus, The patterns discovered should be interesting because either they represent common skill or lack novelty.

Spectacle Issues

There can be performance-related issues such spil goes after &minus,

Efficiency and scalability of gegevens mining algorithms &minus, Ter order to effectively samenvatting the information from enormous amount of gegevens te databases, gegevens mining algorithm voorwaarde be efficient and scalable.

Parallel, distributed, and incremental mining algorithms &minus, The factors such spil large size of databases, broad distribution of gegevens, and complexity of gegevens mining methods motivate the development of parallel and distributed gegevens mining algorithms. Thesis algorithms divide the gegevens into partitions which is further processed te a parallel style. Then the results from the partitions is merged. The incremental algorithms, update databases without mining the gegevens again from scrape.

Diverse Gegevens Types Issues

Treating of relational and elaborate types of gegevens &minus, The database may contain elaborate gegevens objects, multimedia gegevens objects, spatial gegevens, temporal gegevens etc. It is not possible for one system to mine all thesis zuigeling of gegevens.

Mining information from heterogeneous databases and global information systems &minus, The gegevens is available at different gegevens sources on LAN or WAN. Thesis gegevens source may be structured, semi structured or unstructured. Therefore mining the skill from them adds challenges to gegevens mining.

Gegevens Warehouse

A gegevens warehouse exhibits the following characteristics to support the management’s decision-making process &minus,

Subject Oriented &minus, Gegevens warehouse is subject oriented because it provides us the information around a subject rather than the organization’s ongoing operations. Thesis subjects can be product, customers, suppliers, sales, revenue, etc. The gegevens warehouse does not concentrate on the ongoing operations, rather it concentrates on modelling and analysis of gegevens for decision-making.

Integrated &minus, Gegevens warehouse is constructed by integration of gegevens from heterogeneous sources such spil relational databases, vapid files etc. This integration enhances the effective analysis of gegevens.

Time Variant &minus, The gegevens collected ter a gegevens warehouse is identified with a particular time period. The gegevens ter a gegevens warehouse provides information from a historical point of view.

Non-volatile &minus, Nonvolatile means the previous gegevens is not liquidated when fresh gegevens is added to it. The gegevens warehouse is kept separate from the operational database therefore frequent switches te operational database is not reflected ter the gegevens warehouse.

Gegevens Warehousing

Gegevens warehousing is the process of constructing and using the gegevens warehouse. A gegevens warehouse is constructed by integrating the gegevens from numerous heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries, and decision making.

Gegevens warehousing involves gegevens cleaning, gegevens integration, and gegevens consolidations. To integrate heterogeneous databases, wij have the following two approaches &minus,

  • Query Driven Treatment
  • Update Driven Treatment

Query-Driven Treatment

This is the traditional treatment to integrate heterogeneous databases. This treatment is used to build wrappers and integrators on top of numerous heterogeneous databases. Thesis integrators are also known spil mediators.

Process of Query Driven Treatment

When a query is issued to a client side, a metadata dictionary translates the query into the queries, adequate for the individual heterogeneous webpagina involved.

Now thesis queries are mapped and sent to the local query processor.

The results from heterogeneous sites are integrated into a global reaction set.


This treatment has the following disadvantages &minus,

The Query Driven Treatment needs ingewikkeld integration and filtering processes.

It is very inefficient and very expensive for frequent queries.

This treatment is expensive for queries that require aggregations.

Update-Driven Treatment

Today’s gegevens warehouse systems go after update-driven treatment rather than the traditional treatment discussed earlier. Te the update-driven treatment, the information from numerous heterogeneous sources is integrated te advance and stored ter a warehouse. This information is available for rechtstreeks querying and analysis.


This treatment has the following advantages &minus,

This treatment provides high voorstelling.

The gegevens can be copied, processed, integrated, annotated, summarized and restructured ter the semantic gegevens store ter advance.

Query processing does not require interface with the processing at local sources.

From Gegevens Warehousing (OLAP) to Gegevens Mining (OLAM)

Online Analytical Mining integrates with Online Analytical Processing with gegevens mining and mining skill te multidimensional databases. Here is the diagram that shows the integration of both OLAP and OLAM &minus,

Importance of OLAM

OLAM is significant for the following reasons &minus,

High quality of gegevens te gegevens warehouses &minus, The gegevens mining contraptions are required to work on integrated, consistent, and cleaned gegevens. Thesis steps are very costly ter the preprocessing of gegevens. The gegevens warehouses constructed by such preprocessing are valuable sources of high quality gegevens for OLAP and gegevens mining spil well.

Available information processing infrastructure surrounding gegevens warehouses &minus, Information processing infrastructure refers to accessing, integration, consolidation, and transformation of numerous heterogeneous databases, web-accessing and service facilities, reporting and OLAP analysis instruments.

OLAP&minus,based exploratory gegevens analysis &minus, Exploratory gegevens analysis is required for effective gegevens mining. OLAM provides facility for gegevens mining on various subset of gegevens and at different levels of abstraction.

Online selection of gegevens mining functions &minus, Integrating OLAP with numerous gegevens mining functions and online analytical mining provide users with the plasticity to select desired gegevens mining functions and exchange gegevens mining tasks dynamically.

Gegevens Mining

Gegevens mining is defined spil extracting the information from a giant set of gegevens. Ter other words wij can say that gegevens mining is mining the skill from gegevens. This information can be used for any of the following applications &minus,

  • Market Analysis
  • Fraud Detection
  • Customer Retention
  • Production Control
  • Science Exploration

Gegevens Mining Engine

Gegevens mining engine is very essential to the gegevens mining system. It consists of a set of functional modules that perform the following functions &minus,

  • Characterization
  • Association and Correlation Analysis
  • Classification
  • Prediction
  • Cluster analysis
  • Outlier analysis
  • Evolution analysis

Skill Base

This is the domain skill. This skill is used to guide the search or evaluate the interestingness of the resulting patterns.

Skill Discovery

Some people treat gegevens mining same spil skill discovery, while others view gegevens mining spil an essential step ter the process of skill discovery. Here is the list of steps involved te the skill discovery process &minus,

  • Gegevens Cleaning
  • Gegevens Integration
  • Gegevens Selection
  • Gegevens Transformation
  • Gegevens Mining
  • Pattern Evaluation
  • Skill Presentation

User interface

User interface is the module of gegevens mining system that helps the communication inbetween users and the gegevens mining system. User Interface permits the following functionalities &minus,

  • Interact with the system by specifying a gegevens mining query task.
  • Providing information to help concentrate the search.
  • Mining based on the intermediate gegevens mining results.
  • Browse database and gegevens warehouse schemas or gegevens structures.
  • Evaluate mined patterns.
  • Visualize the patterns te different forms.

Gegevens Integration

Gegevens Integration is a gegevens preprocessing technology that merges the gegevens from numerous heterogeneous gegevens sources into a samenhangend gegevens store. Gegevens integration may involve onbestendig gegevens and therefore needs gegevens cleaning.

Gegevens Cleaning

Gegevens cleaning is a technology that is applied to eliminate the noisy gegevens and onberispelijk the inconsistencies ter gegevens. Gegevens cleaning involves transformations to onberispelijk the wrong gegevens. Gegevens cleaning is performed spil a gegevens preprocessing step while preparing the gegevens for a gegevens warehouse.

Gegevens Selection

Gegevens Selection is the process where gegevens relevant to the analysis task are retrieved from the database. Sometimes gegevens transformation and consolidation are performed before the gegevens selection process.


Cluster refers to a group of similar kleintje of objects. Cluster analysis refers to forming group of objects that are very similar to each other but are very different from the objects ter other clusters.

Gegevens Transformation

Te this step, gegevens is transformed or consolidated into forms suitable for mining, by performing summary or aggregation operations.

What is Skill Discovery?

Some people don’t differentiate gegevens mining from skill discovery while others view gegevens mining spil an essential step te the process of skill discovery. Here is the list of steps involved te the skill discovery process &minus,

Gegevens Cleaning &minus, Te this step, the noise and onbestendig gegevens is liquidated.

Gegevens Integration &minus, Te this step, numerous gegevens sources are combined.

Gegevens Selection &minus, Te this step, gegevens relevant to the analysis task are retrieved from the database.

Gegevens Transformation &minus, Te this step, gegevens is transformed or consolidated into forms adequate for mining by performing summary or aggregation operations.

Gegevens Mining &minus, Te this step, slim methods are applied te order to samenvatting gegevens patterns.

Pattern Evaluation &minus, Te this step, gegevens patterns are evaluated.

Skill Presentation &minus, Ter this step, skill is represented.

The following diagram shows the process of skill discovery &minus,

There is a large diversity of gegevens mining systems available. Gegevens mining systems may integrate technologies from the following &minus,

  • Spatial Gegevens Analysis
  • Information Retrieval
  • Pattern Recognition
  • Picture Analysis
  • Signal Processing
  • Pc Graphics
  • Web Technology
  • Business
  • Bioinformatics

Gegevens Mining System Classification

A gegevens mining system can be classified according to the following criteria &minus,

  • Database Technology
  • Statistics
  • Machine Learning
  • Information Science
  • Visualization
  • Other Disciplines

Chic from thesis, a gegevens mining system can also be classified based on the zuigeling of (a) databases mined, (b) skill mined, (c) mechanisms utilized, and (d) applications adapted.

Classification Based on the Databases Mined

Wij can classify a gegevens mining system according to the kleuter of databases mined. Database system can be classified according to different criteria such spil gegevens models, types of gegevens, etc. And the gegevens mining system can be classified accordingly.

For example, if wij classify a database according to the gegevens monster, then wij may have a relational, transactional, object-relational, or gegevens warehouse mining system.

Classification Based on the kleuter of Skill Mined

Wij can classify a gegevens mining system according to the kleuter of skill mined. It means the gegevens mining system is classified on the voet of functionalities such spil &minus,

  • Characterization
  • Discrimination
  • Association and Correlation Analysis
  • Classification
  • Prediction
  • Outlier Analysis
  • Evolution Analysis

Classification Based on the Technologies Utilized

Wij can classify a gegevens mining system according to the kleuter of technologies used. Wij can describe thesis technics according to the degree of user interaction involved or the methods of analysis employed.

Classification Based on the Applications Adapted

Wij can classify a gegevens mining system according to the applications adapted. Thesis applications are spil goes after &minus,

Integrating a Gegevens Mining System with a DB/DW System

If a gegevens mining system is not integrated with a database or a gegevens warehouse system, then there will be no system to communicate with. This scheme is known spil the non-coupling scheme. Ter this scheme, the main concentrate is on gegevens mining vormgeving and on developing efficient and effective algorithms for mining the available gegevens sets.

The list of Integration Schemes is spil goes after &minus,

No Coupling &minus, Te this scheme, the gegevens mining system does not utilize any of the database or gegevens warehouse functions. It fetches the gegevens from a particular source and processes that gegevens using some gegevens mining algorithms. The gegevens mining result is stored ter another opstopping.

Liberate Coupling &minus, Ter this scheme, the gegevens mining system may use some of the functions of database and gegevens warehouse system. It fetches the gegevens from the gegevens respiratory managed by thesis systems and performs gegevens mining on that gegevens. It then stores the mining result either ter a verkeersopstopping or te a designated place te a database or te a gegevens warehouse.

Semi&minus,taut Coupling &minus, Ter this scheme, the gegevens mining system is linked with a database or a gegevens warehouse system and te addition to that, efficient implementations of a few gegevens mining primitives can be provided ter the database.

Taut coupling &minus, Ter this coupling scheme, the gegevens mining system is slickly integrated into the database or gegevens warehouse system. The gegevens mining subsystem is treated spil one functional component of an information system.

The Gegevens Mining Query Language (DMQL) wasgoed proposed by Han, Fu, Wang, et alreeds. for the DBMiner gegevens mining system. The Gegevens Mining Query Language is actually based on the Structured Query Language (SQL). Gegevens Mining Query Languages can be designed to support ad hoc and interactive gegevens mining. This DMQL provides directives for specifying primitives. The DMQL can work with databases and gegevens warehouses spil well. DMQL can be used to define gegevens mining tasks. Particularly wij examine how to define gegevens warehouses and gegevens marts te DMQL.

Syntax for Task-Relevant Gegevens Specification

Here is the syntax of DMQL for specifying task-relevant gegevens &minus,

Syntax for Specifying the Kleintje of Skill

Here wij will discuss the syntax for Characterization, Discrimination, Association, Classification, and Prediction.


The syntax for characterization is &minus,

The analyze clause, specifies aggregate measures, such spil count, sum, or count%,.


The syntax for Discrimination is &minus,

For example, a user may define big spenders spil customers who purchase items that cost $100 or more on an average, and budget spenders spil customers who purchase items at less than $100 on an average. The mining of discriminant descriptions for customers from each of thesis categories can be specified ter the DMQL spil &minus,


The syntax for Association is&minus,

where X is key of customer relation, P and Q are predicate variables, and W, Y, and Z are object variables.


The syntax for Classification is &minus,

For example, to mine patterns, classifying customer credit rating where the classes are determined by the attribute credit_rating, and mine classification is determined spil classifyCustomerCreditRating.


The syntax for prediction is &minus,

Syntax for Concept Hierarchy Specification

To specify concept hierarchies, use the following syntax &minus,

Wij use different syntaxes to define different types of hierarchies such spil&minus,

Syntax for Interestingness Measures Specification

Interestingness measures and thresholds can be specified by the user with the statement &minus,

Syntax for Pattern Presentation and Visualization Specification

Wij have a syntax, which permits users to specify the display of discovered patterns ter one or more forms.

Utter Specification of DMQL

Spil a market manager of a company, you would like to characterize the buying habits of customers who can purchase items priced at no less than $100, with respect to the customer’s age, type of voorwerp purchased, and the place where the voorwerp wasgoed purchased. You would like to know the percentage of customers having that characteristic. Te particular, you are only interested ter purchases made ter Canada, and paid with an American Express credit card. You would like to view the resulting descriptions ter the form of a table.

Gegevens Mining Languages Standardization

Standardizing the Gegevens Mining Languages will serve the following purposes &minus,

  • Helps systematic development of gegevens mining solutions.
  • Improves interoperability among numerous gegevens mining systems and functions.
  • Promotes education and rapid learning.
  • Promotes the use of gegevens mining systems te industry and society.

There are two forms of gegevens analysis that can be used for extracting models describing significant classes or to predict future gegevens trends. Thesis two forms are spil goes after &minus,

Classification models predict categorical class labels, and prediction models predict continuous valued functions. For example, wij can build a classification monster to categorize bankgebouw loan applications spil either safe or risky, or a prediction specimen to predict the expenditures ter dollars of potential customers on pc equipment given their income and occupation.

What is classification?

Following are the examples of cases where the gegevens analysis task is Classification &minus,

A handelsbank loan officer wants to analyze the gegevens te order to know which customer (loan applicant) are risky or which are safe.

A marketing manager at a company needs to analyze a customer with a given profile, who will buy a fresh rekentuig.

Ter both of the above examples, a specimen or classifier is constructed to predict the categorical labels. Thesis labels are risky or safe for loan application gegevens and yes or no for marketing gegevens.

What is prediction?

Following are the examples of cases where the gegevens analysis task is Prediction &minus,

Suppose the marketing manager needs to predict how much a given customer will spend during a sale at his company. Te this example wij are bothered to predict a numeric value. Therefore the gegevens analysis task is an example of numeric prediction. Te this case, a prototype or a predictor will be constructed that predicts a continuous-valued-function or ordered value.

Note &minus, Regression analysis is a statistical methodology that is most often used for numeric prediction.

How Does Classification Works?

With the help of the handelsbank loan application that wij have discussed above, let us understand the working of classification. The Gegevens Classification process includes two steps &minus,

  • Building the Classifier or Proefje
  • Using Classifier for Classification

Building the Classifier or Prototype

This step is the learning step or the learning phase.

Te this step the classification algorithms build the classifier.

The classifier is built from the training set made up of database tuples and their associated class labels.

Each tuple that constitutes the training set is referred to spil a category or class. Thesis tuples can also be referred to spil sample, object or gegevens points.

Using Classifier for Classification

Ter this step, the classifier is used for classification. Here the test gegevens is used to estimate the accuracy of classification rules. The classification rules can be applied to the fresh gegevens tuples if the accuracy is considered acceptable.

Classification and Prediction Issues

The major kwestie is preparing the gegevens for Classification and Prediction. Preparing the gegevens involves the following activities &minus,

Gegevens Cleaning &minus, Gegevens cleaning involves removing the noise and treatment of missing values. The noise is liquidated by applying smoothing mechanisms and the problem of missing values is solved by substituting a missing value with most commonly occurring value for that attribute.

Relevance Analysis &minus, Database may also have the irrelevant attributes. Correlation analysis is used to know whether any two given attributes are related.

Gegevens Transformation and reduction &minus, The gegevens can be transformed by any of the following methods.

Normalization &minus, The gegevens is transformed using normalization. Normalization involves scaling all values for given attribute te order to make them fall within a puny specified range. Normalization is used when te the learning step, the neural networks or the methods involving measurements are used.

Generalization &minus, The gegevens can also be transformed by generalizing it to the higher concept. For this purpose wij can use the concept hierarchies.

Note &minus, Gegevens can also be diminished by some other methods such spil wavelet transformation, binning, histogram analysis, and clustering.

Comparison of Classification and Prediction Methods

Here is the criteria for comparing the methods of Classification and Prediction &minus,

Accuracy &minus, Accuracy of classifier refers to the capability of classifier. It predict the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a fresh gegevens.

Speed &minus, This refers to the computational cost ter generating and using the classifier or predictor.

Robustness &minus, It refers to the capability of classifier or predictor to make juist predictions from given noisy gegevens.

Scalability &minus, Scalability refers to the capability to construct the classifier or predictor efficiently, given large amount of gegevens.

Interpretability &minus, It refers to what extent the classifier or predictor understands.

A decision tree is a structure that includes a root knot, branches, and leaf knots. Each internal knot denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf knot holds a class label. The topmost knot te the tree is the root knot.

The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely to buy a rekentuig or not. Each internal knot represents a test on an attribute. Each leaf knot represents a class.

The benefits of having a decision tree are spil goes after &minus,

  • It does not require any domain skill.
  • It is effortless to comprehend.
  • The learning and classification steps of a decision tree are ordinary and quick.

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan ter 1980 developed a decision tree algorithm known spil ID3 (Iterative Dichotomiser). Zometeen, he introduced C4.Five, which wasgoed the successor of ID3. ID3 and C4.Five adopt a greedy treatment. Ter this algorithm, there is no backtracking, the trees are constructed ter a top-down recursive divide-and-conquer manner.

Tree Pruning

Tree pruning is performed ter order to liquidate anomalies ter the training gegevens due to noise or outliers. The pruned trees are smaller and less elaborate.

Tree Pruning Approaches

There are two approaches to prune a tree &minus,

Pre-pruning &minus, The tree is pruned by halting its construction early.

Post-pruning – This treatment liquidates a sub-tree from a fully grown tree.

Cost Complexity

The cost complexity is measured by the following two parameters &minus,

  • Number of leaves ter the tree, and
  • Error rate of the tree.

Bayesian classification is based on Bayes’ Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class membership probabilities such spil the probability that a given tuple belongs to a particular class.

Baye’s Theorem

Bayes’ Theorem is named after Thomas Bayes. There are two types of probabilities &minus,

where X is gegevens tuple and H is some hypothesis.

According to Bayes’ Theorem,

Bayesian Belief Network

Bayesian Belief Networks specify snaak conditional probability distributions. They are also known spil Belief Networks, Bayesian Networks, or Probabilistic Networks.

A Belief Network permits class conditional independencies to be defined inbetween subsets of variables.

It provides a graphical proefje of causal relationship on which learning can be performed.

Wij can use a trained Bayesian Network for classification.

There are two components that define a Bayesian Belief Network &minus,

  • Directed acyclic graph
  • A set of conditional probability tables

Directed Acyclic Graph

  • Each knot ter a directed acyclic graph represents a random variable.
  • Thesis variable may be discrete or continuous valued.
  • Thesis variables may correspond to the actual attribute given te the gegevens.

Directed Acyclic Graph Representation

The following diagram shows a directed acyclic graph for six Boolean variables.

The arc ter the diagram permits representation of causal skill. For example, lung cancer is influenced by a person’s family history of lung cancer, spil well spil whether or not the person is a smoker. It is worth noting that the variable PositiveXray is independent of whether the patient has a family history of lung cancer or that the patient is a smoker, given that wij know the patient has lung cancer.

Conditional Probability Table

The conditional probability table for the values of the variable LungCancer (LC) displaying each possible combination of the values of its parent knots, FamilyHistory (FH), and Smoker (S) is spil goes after &minus,


Rule-based classifier makes use of a set of IF-THEN rules for classification. Wij can express a rule ter the following from &minus,

IF condition THEN conclusion

Let us consider a rule R1,

The IF part of the rule is called rule antecedent or precondition.

The THEN part of the rule is called rule consequent.

The antecedent part the condition consist of one or more attribute tests and thesis tests are logically ANDed.

The consequent part consists of class prediction.

Note &minus, Wij can also write rule R1 spil goes after &minus,

If the condition holds true for a given tuple, then the antecedent is pleased.

Rule Extraction

Here wij will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.

To samenvatting a rule from a decision tree &minus,

One rule is created for each path from the root to the leaf knot.

To form a rule antecedent, each splitting criterion is logically ANDed.

The leaf knot holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm

Sequential Covering Algorithm can be used to samenvatting IF-THEN rules form the training gegevens. Wij do not require to generate a decision tree very first. Te this algorithm, each rule for a given class covers many of the tuples of that class.

Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. Spil vanaf the general strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is eliminated and the process resumes for the surplus of the tuples. This is because the path to each leaf ter a decision tree corresponds to a rule.

Note &minus, The Decision tree induction can be considered spil learning a set of rules at the same time.

The Following is the sequential learning Algorithm where rules are learned for one class at a time. When learning a rule from a class Ci, wij want the rule to voorkant all the tuples from class C only and no tuple form any other class.

Rule Pruning

The rule is pruned is due to the following reason &minus,

The Assessment of quality is made on the original set of training gegevens. The rule may perform well on training gegevens but less well on subsequent gegevens. That’s why the rule pruning is required.

The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has greater quality than what wasgoed assessed on an independent set of tuples.

FOIL is one of the elementary and effective method for rule pruning. For a given rule R,

FOIL_Prune = pos – neg / pos + neg

where pos and neg is the number of positive tuples covered by R, respectively.

Note &minus, This value will increase with the accuracy of R on the pruning set. Hence, if the FOIL_Prune value is higher for the pruned version of R, then wij prune R.

Here wij will discuss other classification methods such spil Genetic Algorithms, Rough Set Treatment, and Fuzzy Set Treatment.

Genetic Algorithms

The idea of genetic algorithm is derived from natural evolution. Ter genetic algorithm, very first of all, the initial population is created. This initial population consists of randomly generated rules. Wij can represent each rule by a string of kattig.

For example, ter a given training set, the samples are described by two Boolean attributes such spil A1 and A2. And this given training set contains two classes such spil C1 and C2.

Wij can encode the rule IF A1 AND NOT A2 THEN C2 into a bit string 100. Te this bit representation, the two leftmost kattig represent the attribute A1 and A2, respectively.

Likewise, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded spil 001.

Note &minus, If the attribute has K values where K>Two, then wij can use the K kattig to encode the attribute values. The classes are also encoded te the same manner.

Points to reminisce &minus,

Based on the notion of the survival of the fittest, a fresh population is formed that consists of the fittest rules ter the current population and offspring values of thesis rules spil well.

The fitness of a rule is assessed by its classification accuracy on a set of training samples.

The genetic operators such spil crossover and mutation are applied to create offspring.

Te crossover, the substring from pair of rules are exchanged to form a fresh pair of rules.

Te mutation, randomly selected kattig ter a rule’s string are inverted.

Rough Set Treatment

Wij can use the rough set treatment to detect structural relationship within imprecise and noisy gegevens.

Note &minus, This treatment can only be applied on discrete-valued attributes. Therefore, continuous-valued attributes voorwaarde be discretized before its use.

The Rough Set Theory is based on the establishment of equivalence classes within the given training gegevens. The tuples that forms the equivalence class are indiscernible. It means the samples are identical with respect to the attributes describing the gegevens.

There are some classes te the given real world gegevens, which cannot be distinguished ter terms of available attributes. Wij can use the rough sets to harshly define such classes.

For a given class C, the rough set definition is approximated by two sets spil goes after &minus,

Lower Approximation of C &minus, The lower approximation of C consists of all the gegevens tuples, that based on the skill of the attribute, are certain to belong to class C.

Upper Approximation of C &minus, The upper approximation of C consists of all the tuples, that based on the skill of attributes, cannot be described spil not belonging to C.

The following diagram shows the Upper and Lower Approximation of class C &minus,

Fuzzy Set Approaches

Fuzzy Set Theory is also called Possibility Theory. This theory wasgoed proposed by Lotfi Zadeh te 1965 spil an alternative the two-value logic and probability theory. This theory permits us to work at a high level of abstraction. It also provides us the means for dealing with imprecise measurement of gegevens.

The fuzzy set theory also permits us to overeenkomst with vague or inexact facts. For example, being a member of a set of high incomes is ter precies (e.g. if $50,000 is high then what about $49,000 and $48,000). Unlike the traditional CRISP set where the factor either belong to S or its complement but te fuzzy set theory the factor can belong to more than one fuzzy set.

For example, the income value $49,000 belongs to both the medium and high fuzzy sets but to differing degrees. Fuzzy set notation for this income value is spil goes after &minus,

where ‘m’ is the membership function that operates on the fuzzy sets of medium_income and high_income respectively. This notation can be shown diagrammatically spil goes after &minus,

Cluster is a group of objects that belongs to the same class. Te other words, similar objects are grouped te one cluster and dissimilar objects are grouped ter another cluster.

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar objects.

Points to Reminisce

A cluster of gegevens objects can be treated spil one group.

While doing cluster analysis, wij very first partition the set of gegevens into groups based on gegevens similarity and then assign the labels to the groups.

The main advantage of clustering overheen classification is that, it is adaptable to switches and helps single out useful features that distinguish different groups.

Applications of Cluster Analysis

Clustering analysis is broadly used ter many applications such spil market research, pattern recognition, gegevens analysis, and picture processing.

Clustering can also help marketers detect distinct groups ter their customer base. And they can characterize their customer groups based on the purchasing patterns.

Ter the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionalities and build up insight into structures inherent to populations.

Clustering also helps ter identification of areas of similar land use ter an earth observation database. It also helps te the identification of groups of houses ter a city according to house type, value, and geographic location.

Clustering also helps te classifying documents on the web for information discovery.

Clustering is also used ter outlier detection applications such spil detection of credit card fraud.

Spil a gegevens mining function, cluster analysis serves spil a contraption to build up insight into the distribution of gegevens to observe characteristics of each cluster.

Requirements of Clustering ter Gegevens Mining

The following points throw light on why clustering is required ter gegevens mining &minus,

Scalability &minus, Wij need very scalable clustering algorithms to overeenkomst with large databases.

Capability to overeenkomst with different kinds of attributes &minus, Algorithms should be capable to be applied on any kleuter of gegevens such spil interval-based (numerical) gegevens, categorical, and binary gegevens.

Discovery of clusters with attribute form &minus, The clustering algorithm should be capable of detecting clusters of arbitrary form. They should not be bounded to only distance measures that tend to find spherical cluster of puny sizes.

High dimensionality &minus, The clustering algorithm should not only be able to treat low-dimensional gegevens but also the high dimensional space.

Capability to overeenkomst with noisy gegevens &minus, Databases contain noisy, missing or erroneous gegevens. Some algorithms are sensitive to such gegevens and may lead to poor quality clusters.

Interpretability &minus, The clustering results should be interpretable, comprehensible, and usable.

Clustering Methods

Clustering methods can be classified into the following categories &minus,

  • Partitioning Method
  • Hierarchical Method
  • Density-based Method
  • Grid-Based Method
  • Model-Based Method
  • Constraint-based Method

Partitioning Method

Suppose wij are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of gegevens. Each partition will represent a cluster and k &le, n. It means that it will classify the gegevens into k groups, which please the following requirements &minus,

Each group contains at least one object.

Each object voorwaarde belong to exactly one group.

For a given number of partitions (say k), the partitioning method will create an initial partitioning.

Then it uses the iterative relocation mechanism to improve the partitioning by moving objects from one group to other.

Hierarchical Methods

This method creates a hierarchical decomposition of the given set of gegevens objects. Wij can classify hierarchical methods on the ondergrond of how the hierarchical decomposition is formed. There are two approaches here &minus,

Agglomerative Treatment

This treatment is also known spil the bottom-up treatment. Te this, wij embark with each object forming a separate group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds.

Divisive Treatment

This treatment is also known spil the top-down treatment. Te this, wij begin with all of the objects te the same cluster. Ter the continuous iteration, a cluster is split up into smaller clusters. It is down until each object te one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical clustering &minus,

Perform careful analysis of object linkages at each hierarchical partitioning.

Integrate hierarchical agglomeration by very first using a hierarchical agglomerative algorithm to group objects into micro-clusters, and then performing macro-clustering on the micro-clusters.

Density-based Method

This method is based on the notion of density. The basic idea is to proceed growing the given cluster spil long spil the density ter the neighborhood exceeds some threshold, i.e., for each gegevens point within a given cluster, the radius of a given cluster has to contain at least a ondergrens number of points.

Grid-based Method

Ter this, the objects together form a grid. The object space is quantized into finite number of cells that form a grid structure.

The major advantage of this method is swift processing time.

It is dependent only on the number of cells te each dimension te the quantized space.

Model-based methods

Ter this method, a proefje is hypothesized for each cluster to find the best getraind of gegevens for a given monster. This method locates the clusters by clustering the density function. It reflects spatial distribution of the gegevens points.

This method also provides a way to automatically determine the number of clusters based on standard statistics, taking outlier or noise into account. It therefore yields sturdy clustering methods.

Constraint-based Method

Ter this method, the clustering is performed by the incorporation of user or application-oriented constraints. A constraint refers to the user expectation or the properties of desired clustering results. Constraints provide us with an interactive way of communication with the clustering process. Constraints can be specified by the user or the application requirement.

Text databases consist of large collection of documents. They collect thesis information from several sources such spil news articles, books, digital libraries, e-mail messages, web pages, etc. Due to increase ter the amount of information, the text databases are growing rapidly. Te many of the text databases, the gegevens is semi-structured.

For example, a document may contain a few structured fields, such spil title, author, publishing_date, etc. But along with the structure gegevens, the document also contains unstructured text components, such spil abstract and contents. Without knowing what could be te the documents, it is difficult to formulate effective queries for analyzing and extracting useful information from the gegevens. Users require devices to compare the documents and rank their importance and relevance. Therefore, text mining has become popular and an essential theme te gegevens mining.

Information Retrieval

Information retrieval deals with the retrieval of information from a large number of text-based documents. Some of the database systems are not usually present ter information retrieval systems because both treat different kinds of gegevens. Examples of information retrieval system include &minus,

  • Online Library catalogue system
  • Online Document Management Systems
  • Web Search Systems etc.

Note &minus, The main problem ter an information retrieval system is to locate relevant documents ter a document collection based on a user’s query. This kleuter of user’s query consists of some keywords describing an information need.

Te such search problems, the user takes an initiative to pull relevant information out from a collection. This is adequate when the user has ad-hoc information need, i.e., a short-term need. But if the user has a long-term information need, then the retrieval system can also take an initiative to thrust any freshly arrived information voorwerp to the user.

This zuigeling of access to information is called Information Filtering. And the corresponding systems are known spil Filtering Systems or Recommender Systems.

Basic Measures for Text Retrieval

Wij need to check the accuracy of a system when it retrieves a number of documents on the voet of user’s input. Let the set of documents relevant to a query be denoted spil and the set of retrieved document spil . The set of documents that are relevant and retrieved can be denoted spil . This can be shown te the form of a Venn diagram spil goes after &minus,

There are three fundamental measures for assessing the quality of text retrieval &minus,


Precision is the percentage of retrieved documents that are ter fact relevant to the query. Precision can be defined spil &minus,


Recall is the percentage of documents that are relevant to the query and were te fact retrieved. Recall is defined spil &minus,


F-score is the commonly used trade-off. The information retrieval system often needs to trade-off for precision or vice versa. F-score is defined spil harmonic mean of recall or precision spil goes after &minus,

The World Broad Web contains hefty amounts of information that provides a rich source for gegevens mining.

Challenges ter Web Mining

The web poses fine challenges for resource and skill discovery based on the following observations &minus,

The web is too hefty &minus, The size of the web is very giant and rapidly enlargening. This seems that the web is too thick for gegevens warehousing and gegevens mining.

Complexity of Web pages &minus, The web pages do not have unifying structure. They are very ingewikkeld spil compared to traditional text document. There are phat amount of documents te digital library of web. Thesis libraries are not arranged according to any particular sorted order.

Web is dynamic information source &minus, The information on the web is rapidly updated. The gegevens such spil news, stock markets, weather, sports, shopping, etc., are regularly updated.

Diversity of user communities &minus, The user community on the web is rapidly expanding. Thesis users have different backgrounds, interests, and usage purposes. There are more than 100 million workstations that are connected to the Internet and still rapidly enlargening.

Relevancy of Information &minus, It is considered that a particular person is generally interested te only puny portion of the web, while the surplus of the portion of the web contains the information that is not relevant to the user and may swamp desired results.

Mining Web pagina layout structure

The basic structure of the web pagina is based on the Document Object Prototype (Onverstandig). The Onverstandig structure refers to a tree like structure where the HTML tag te the pagina corresponds to a knot ter the Onverstandig tree. Wij can segment the web pagina by using predefined tags ter HTML. The HTML syntax is supple therefore, the web pages does not go after the W3C specifications. Not following the specifications of W3C may cause error ter Onverstandig tree structure.

The Onverstandig structure wasgoed primarily introduced for presentation ter the browser and not for description of semantic structure of the web pagina. The Onverstandig structure cannot correctly identify the semantic relationship inbetween the different parts of a web pagina.

Vision-based pagina segmentation (VIPS)

The purpose of VIPS is to samenvatting the semantic structure of a web pagina based on its visual presentation.

Such a semantic structure corresponds to a tree structure. Te this tree each knot corresponds to a block.

A value is assigned to each knot. This value is called the Degree of Coherence. This value is assigned to indicate the samenhangend content te the block based on visual perception.

The VIPS algorithm very first extracts all the suitable blocks from the HTML Onverstandig tree. After that it finds the separators inbetween thesis blocks.

The separators refer to the horizontal or vertical lines ter a web pagina that visually cross with no blocks.

The semantics of the web pagina is constructed on the onderstel of thesis blocks.

The following figure shows the proces of VIPS algorithm &minus,

Gegevens mining is widely used ter diverse areas. There are a number of commercial gegevens mining system available today and yet there are many challenges te this field. Te this tutorial, wij will discuss the applications and the trend of gegevens mining.

Gegevens Mining Applications

Here is the list of areas where gegevens mining is widely used &minus,

  • Financial Gegevens Analysis
  • Retail Industry
  • Telecommunication Industry
  • Biological Gegevens Analysis
  • Other Scientific Applications
  • Intrusion Detection

Financial Gegevens Analysis

The financial gegevens te banking and financial industry is generally reliable and of high quality which facilitates systematic gegevens analysis and gegevens mining. Some of the typical cases are spil goes after &minus,

Vormgeving and construction of gegevens warehouses for multidimensional gegevens analysis and gegevens mining.

Loan payment prediction and customer credit policy analysis.

Classification and clustering of customers for targeted marketing.

Detection of money laundering and other financial crimes.

Retail Industry

Gegevens Mining has its fine application ter Retail Industry because it collects large amount of gegevens from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of gegevens collected will proceed to expand rapidly because of the enlargening ease, availability and popularity of the web.

Gegevens mining te retail industry helps ter identifying customer buying patterns and trends that lead to improved quality of customer service and good customer retention and satisfaction. Here is the list of examples of gegevens mining ter the retail industry &minus,

Vormgeving and Construction of gegevens warehouses based on the benefits of gegevens mining.

Multidimensional analysis of sales, customers, products, time and region.

Analysis of effectiveness of sales campaigns.

Product recommendation and cross-referencing of items.

Telecommunication Industry

Today the telecommunication industry is one of the most emerging industries providing various services such spil fax, pager, cellular phone, internet messenger, pics, e-mail, web gegevens transmission, etc. Due to the development of fresh rekentuig and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why gegevens mining is become very significant to help and understand the business.

Gegevens mining te telecommunication industry helps ter identifying the telecommunication patterns, catch fraudulent activities, make better use of resource, and improve quality of service. Here is the list of examples for which gegevens mining improves telecommunication services &minus,

Multidimensional Analysis of Telecommunication gegevens.

Fraudulent pattern analysis.

Identification of unusual patterns.

Multidimensional association and sequential patterns analysis.

Mobile Telecommunication services.

Use of visualization contraptions te telecommunication gegevens analysis.

Biological Gegevens Analysis

Te latest times, wij have seen a tremendous growth ter the field of biology such spil genomics, proteomics, functional Genomics and biomedical research. Biological gegevens mining is a very significant part of Bioinformatics. Following are the aspects ter which gegevens mining contributes for biological gegevens analysis &minus,

Semantic integration of heterogeneous, distributed genomic and proteomic databases.

Alignment, indexing, similarity search and comparative analysis numerous nucleotide sequences.

Discovery of structural patterns and analysis of genetic networks and protein pathways.

Association and path analysis.

Visualization devices te genetic gegevens analysis.

Other Scientific Applications

The applications discussed above tend to treat relatively petite and homogeneous gegevens sets for which the statistical technics are adequate. Giant amount of gegevens have bot collected from scientific domains such spil geosciences, astronomy, etc. A large amount of gegevens sets is being generated because of the quick numerical simulations te various fields such spil climate and ecosystem modeling, chemical engineering, fluid dynamics, etc. Following are the applications of gegevens mining ter the field of Scientific Applications &minus,

  • Gegevens Warehouses and gegevens preprocessing.
  • Graph-based mining.
  • Visualization and domain specific skill.

Intrusion Detection

Intrusion refers to any kleuter of activity that menaces integrity, confidentiality, or the availability of network resources. Te this world of connectivity, security has become the major kwestie. With enhanced usage of internet and availability of the implements and tricks for intruding and attacking network prompted intrusion detection to become a critical component of network administration. Here is the list of areas ter which gegevens mining technology may be applied for intrusion detection &minus,

Development of gegevens mining algorithm for intrusion detection.

Association and correlation analysis, aggregation to help select and build discriminating attributes.

Analysis of Stream gegevens.

Distributed gegevens mining.

Visualization and query contraptions.

Gegevens Mining System Products

There are many gegevens mining system products and domain specific gegevens mining applications. The fresh gegevens mining systems and applications are being added to the previous systems. Also, efforts are being made to standardize gegevens mining languages.

Choosing a Gegevens Mining System

The selection of a gegevens mining system depends on the following features &minus,

Gegevens Types &minus, The gegevens mining system may treat formatted text, record-based gegevens, and relational gegevens. The gegevens could also be ter ASCII text, relational database gegevens or gegevens warehouse gegevens. Therefore, wij should check what precies format the gegevens mining system can treat.

System Issues &minus, Wij voorwaarde consider the compatibility of a gegevens mining system with different operating systems. One gegevens mining system may run on only one operating system or on several. There are also gegevens mining systems that provide web-based user interfaces and permit XML gegevens spil input.

Gegevens Sources &minus, Gegevens sources refer to the gegevens formats ter which gegevens mining system will operate. Some gegevens mining system may work only on ASCII text files while others on numerous relational sources. Gegevens mining system should also support ODBC connections or OLE DB for ODBC connections.

Gegevens Mining functions and methodologies &minus, There are some gegevens mining systems that provide only one gegevens mining function such spil classification while some provides numerous gegevens mining functions such spil concept description, discovery-driven OLAP analysis, association mining, linkage analysis, statistical analysis, classification, prediction, clustering, outlier analysis, similarity search, etc.

Coupling gegevens mining with databases or gegevens warehouse systems &minus, Gegevens mining systems need to be coupled with a database or a gegevens warehouse system. The coupled components are integrated into a uniform information processing environment. Here are the types of coupling listed below &minus,

  • No coupling
  • Liberate Coupling
  • Semi taut Coupling
  • Taut Coupling

Scalability &minus, There are two scalability issues te gegevens mining &minus,

Row (Database size) Scalability &minus, A gegevens mining system is considered spil row scalable when the number or rows are enlarged Ten times. It takes no more than Ten times to execute a query.

Katern (Dimension) Salability &minus, A gegevens mining system is considered spil katern scalable if the mining query execution time increases linearly with the number of columns.

Visualization Instruments &minus, Visualization te gegevens mining can be categorized spil goes after &minus,

  • Gegevens Visualization
  • Mining Results Visualization
  • Mining process visualization
  • Visual gegevens mining

Gegevens Mining query language and graphical user interface &minus, An easy-to-use graphical user interface is significant to promote user-guided, interactive gegevens mining. Unlike relational database systems, gegevens mining systems do not share underlying gegevens mining query language.

Trends te Gegevens Mining

Gegevens mining concepts are still evolving and here are the latest trends that wij get to see te this field &minus,

  • Application Exploration.
  • Scalable and interactive gegevens mining methods.
  • Integration of gegevens mining with database systems, gegevens warehouse systems and web database systems.
  • SStandardization of gegevens mining query language.
  • Visual gegevens mining.
  • Fresh methods for mining ingewikkeld types of gegevens.
  • Biological gegevens mining.
  • Gegevens mining and software engineering.
  • Web mining.
  • Distributed gegevens mining.
  • Real time gegevens mining.
  • Multi database gegevens mining.
  • Privacy protection and information security ter gegevens mining.

Theoretical Foundations of Gegevens Mining

The theoretical foundations of gegevens mining includes the following concepts &minus,

Gegevens Reduction &minus, The basic idea of this theory is to reduce the gegevens representation which trades accuracy for speed te response to the need to obtain quick approximate answers to queries on very large databases. Some of the gegevens reduction mechanisms are spil goes after &minus,

Related movie: RX VEGA 6000 H/s MONERO MINING Equipment! Profitability? 560 Watts!

Leave a Reply

Your email address will not be published. Required fields are marked *