Skip to main content

Understanding data analysis

 Understanding data analysis

The 21st century is the century of information. We are living in the age of information,

which means that almost every aspect of our daily life is generating data. Not only this, but

business operations, government operations, and social posts are also generating huge data.

This data is accumulating day by day due to data being continually generated from

business, government, scientific, engineering, health, social, climate, and environmental

activities. In all these domains of decision-making, we need a systematic, generalized,

effective, and flexible system for the analytical and scientific process so that we can gain

insights into the data that is being generated.

In today's smart world, data analysis offers an effective decision-making process for

business and government operations. Data analysis is the activity of inspecting, preprocessing, exploring, describing, and visualizing the given dataset. The main objective of

the data analysis process is to discover the required information for decision-making. Data

analysis offers multiple approaches, tools, and techniques, all of which can be applied to

diverse domains such as business, social science, and fundamental science.

Let's look at some of the core fundamental data analysis libraries of the Python ecosystem:

NumPy: This is a short form of numerical Python. It is the most powerful

scientific library available in Python for handling multidimensional arrays,

matrices, and methods in order to compute mathematics efficiently.

SciPy: This is also a powerful scientific computing library for performing

scientific, mathematical, and engineering operations.

Pandas: This is a data exploration and manipulation library that offers tabular

data structures such as DataFrames and various methods for data analysis and

manipulation.

Scikit-learn: This stands for "Scientific Toolkit for Machine learning". It is a

machine learning library that offers a variety of supervised and unsupervised

algorithms, such as regression, classification, dimensionality reduction, cluster

analysis, and anomaly detection.

Matplotlib: This is a core data visualization library and is the base library for all

other visualization libraries in Python. It offers 2D and 3D plots, graphs, charts,

and figures for data exploration. It runs on top of NumPy and SciPy.

Seaborn: This is based on Matplotlib and offers easy to draw, high-level,

interactive, and more organized plots.

Plotly: Plotly is a data visualization library. It offers high quality and interactive

graphs, such as scatter charts, line charts, bar charts, histograms, boxplots,

heatmaps, and subplots.





The standard process of data analysis

Data analysis refers to investigating the data, finding meaningful insights from it, and

drawing conclusions. The main goal of this process is to collect, filter, clean, transform,

explore, describe, visualize, and communicate the insights from this data to discover

decision-making information. Generally, the data analysis process is comprised of the

following phases:

1. Collecting Data: Collect and gather data from several sources.

2. Preprocessing Data: Filter, clean, and transform the data into the required

format.

3. Analyzing and Finding Insights: Explore, describe, and visualize the data and

find insights and conclusions.

4. Insights Interpretations: Understand the insights and find the impact each

variable has on the system.

5. Storytelling: Communicate your results in the form of a story so that a layman

can understand them.






The KDD process

The KDD acronym stands for knowledge discovery from data or Knowledge Discovery in Databases. Many people treat KDD as one synonym for data mining. Data mining is referred to as the knowledge discovery process of interesting patterns. The main objective of KDD is to extract or discover hidden interesting patterns from large databases, data warehouses, and other web and information repositories. The KDD process has seven major phases:

1. Data Cleaning: In this first phase, data is preprocessed. Here, noise is removed, missing values are handled, and outliers are detected.

2. Data Integration: In this phase, data from different sources is combined and integrated together using data migration and ETL tools.

3. Data Selection: In this phase, relevant data for the analysis task is recollected.
Data Transformation: In this phase, data is engineered in the required appropriate form for analysis.

5. Data Mining: In this phase, data mining techniques are used to discover useful and unknown patterns.

6. Pattern Evaluation: In this phase, the extracted patterns are evaluated.

7. Knowledge Presentation: After pattern evaluation, the extracted knowledge needs to be visualized and presented to business people for decision-making purposes.




SEMMA

The SEMMA acronym's full form is Sample, Explore, Modify, Model, and Assess. This

sequential data mining process is developed by SAS. The SEMMA process has five major

phases:

1. Sample: In this phase, we identify different databases and merge them. After this, we select the data sample that's sufficient for the modeling process.

2. Explore: In this phase, we understand the data, discover the relationships among variables, visualize the data, and get initial interpretations.

3. Modify: In this phase, data is prepared for modeling. This phase involves dealing with missing values, detecting outliers, transforming features, and creating new additional features.

4. Model: In this phase, the main concern is selecting and applying different modeling techniques, such as linear and logistic regression, backpropagation networks, KNN, support vector machines, decision trees, and Random Forest.

5. Assess: In this last phase, the predictive models that have been developed are evaluated using performance evaluation measures

The preceding diagram shows the steps involved in the SEMMA process. SEMMA emphasizes model building and assessment. Now, let's discuss the CRISP-DM process.



CRISP-DM

CRISP-DM's full form is CRoss-InduStry Process for Data Mining. CRISP-DM is a welldefined, well-structured, and well-proven process for machine learning, data mining, and

business intelligence projects. It is a robust, flexible, cyclic, useful, and practical approach to

solving business problems. The process discovers hidden valuable information or patterns

from several databases. The CRISP-DM process has six major phases:

1. Business Understanding: In this first phase, the main objective is to understand

the business scenario and requirements for designing an analytical goal and

initial action plan.

2. Data Understanding: In this phase, the main objective is to understand the data

and its collection process, perform data quality checks, and gain initial insights.

3. Data Preparation: In this phase, the main objective is to prepare analytics-ready

data. This involves handling missing values, outlier detection and handling,

normalizing data, and feature engineering. This phase is the most timeconsuming for data scientists/analysts.

4. Modeling: This is the most exciting phase of the whole process since this is

where you design the model for prediction purposes. First, the analyst needs to

decide on the modeling technique and develop models based on data.

5. Evaluation: Once the model has been developed, it's time to assess and test the

model's performance on validation and test data using model evaluation

measures such as MSE, RMSE, R-Square for regression and accuracy, precision,

recall, and the F1-measure.

6. Deployment: In this final phase, the model that was chosen in the previous step

will be deployed to the production environment. This requires a team effort from

data scientists, software developers, DevOps experts, and business professionals.


The following diagram shows the full cycle of the CRISP-DM process:






The standard process focuses on discovering insights and making interpretations in the
form of a story, while KDD focuses on data-driven pattern discovery and visualizing this.
SEMMA majorly focuses on model building tasks, while CRISP-DM focuses on business
understanding and deployment. Now that we know about some of the processes
surrounding data analysis, let's compare data analysis and data science to find out how
they are related, as well as what makes them different from one other.


Comparing data analysis and data science

Data analysis is the process in which data is explored in order to discover patterns that help
us make business decisions. It is one of the subdomains of data science. Data analysis
methods and tools are widely utilized in several business domains by business analysts,
data scientists, and researchers. Its main objective is to improve productivity and
profits. Data analysis extracts and queries data from different sources, performs exploratory
data analysis, visualizes data, prepares reports, and presents it to the business decisionmaking authorities.
On the other hand, data science is an interdisciplinary area that uses a scientific approach to
extract insights from structured and unstructured data. Data science is a union of all terms,
including data analytics, data mining, machine learning, and other related domains. Data
science is not only limited to exploratory data analysis and is used for developing models
and prediction algorithms such as stock price, weather, disease, fraud forecasts, and
recommendations such as movie, book, and music recommendations.



The roles of data analysts and data scientists

A data analyst collects, filters, processes, and applies the required statistical concepts to capture patterns, trends, and insights from data and prepare reports for making decisions.

The main objective of the data analyst is to help companies solve business problems using discovered patterns and trends. The data analyst also assesses the quality of the data and handles the issues concerning data acquisition. A data analyst should be proficient in writing SQL queries, finding patterns, using visualization tools, and using reporting tools Microsoft Power BI, IBM Cognos, Tableau, QlikView, Oracle BI, and more. Data scientists are more technical and mathematical than data analysts. Data scientists are research- and academic-oriented, whereas data analysts are more application-oriented. Data scientists are expected to predict a future event, whereas data analysts extract significant insights out of data. Data scientists develop their own questions, while data analysts find answers to given questions. Finally, data scientists focus on what is going to happen, whereas data analysts focus on what has happened so far. We can summarize these two roles using the following.


Features Data Scientist Data Analyst
Background Predict future events and scenarios based on data Discover meaningful insights from the data.
Role Formulate questions that can profit the businessSolve the business questions to make decisions.

Type of data Work on both structured and unstructured data Only work on structured data
Programming Advanced programming Basic programming
SkillsetKnowledge of statistics, machine learning algorithms, NLP, and deep learningKnowledge of statistics, SQL, and data visualization
Tools R, Python, SAS, Hadoop, Spark, TensorFlow, and KerasExcel, SQL, R, Tableau, and QlikView


Now that we know what defines a data analyst and data scientist, as well as how they are different from each other, let's have a look at the various skills that you would need to become one of them.




Comments

Popular posts from this blog

Cyber Attack Countermeasures : Module 4

 Cyber Attack Countermeasures :  Module 4 Quiz #cyber #quiz #coursera #exam #module #answers 1 . Question 1 CBC mode cryptography involves which of the following? 1 / 1  point Mediation of overt channels Mediation of covert channels Auditing of overt channels Auditing of covert channels None of the above Correct Correct! CBC mode is specifically designed to close covert communication channels in block encryption algorithms. 2 . Question 2 Which is a true statement? 1 / 1  point Conventional crypto scales perfectly well Conventional crypto scales poorly to large groups Conventional crypto does not need to scale All of the above Correct Correct! The symmetric key based method inherent in conventional cryptography does not scale well to large groups. 3 . Question 3 Public Key Cryptography involves which of the following? 1 / 1  point Publicly known secret keys Publicly known private keys Publicly known public keys All of the above ...

Cyber Attack Countermeasures : Module 2 Quiz

Cyber Attack Countermeasures: Module 2 Quiz #cyber #quiz #course #era #answer #module 1 . Question 1 “Identification” in the process of authentication involves which of the following? 1 / 1  point Typing a password Keying in a passphrase Typing in User ID and password Typing in User ID None of the above Correct Correct! The definition of identification involves providing a user’s ID (identification). 2 . Question 2 Which of the following statements is true? 1 / 1  point Identifiers are secret Identifiers are not secret Identifiers are the secret part of authentication All of the above Correct Correct! Identifiers for users are generally not viewed by security experts as being secret. 3 . Question 3 Which of the following is not a good candidate for use as a proof factor in the authentication process? 1 / 1  point Making sure the User ID is correct Typing in a correct password Confirming location, regardless of the country you are in The move...

Rectangular Microstrip Patch Antenna

Microstrip is a type of electrical transmission line which can be fabricated using printed circuit board technology, and is used to convey microwave-frequency signals. It consists of a conducting strip separated from a ground plane by a dielectric layer known as the substrate. The most commonly employed microstrip antenna is a rectangular patch which looks like a truncated  microstrip  transmission line. It is approximately of one-half wavelength long. When air is used as the dielectric substrate, the length of the rectangular microstrip antenna is approximately one-half of a free-space  wavelength . As the antenna is loaded with a dielectric as its substrate, the length of the antenna decreases as the relative  dielectric constant  of the substrate increases. The resonant length of the antenna is slightly shorter because of the extended electric "fringing fields" which increase the electrical length of the antenna slightly. An early model of the microst...