關(guān)于我們
書(shū)單推薦
新書(shū)推薦
|
大數(shù)據(jù)分析基礎(chǔ):概念、技術(shù)、方法和商務(wù)(英文版) 讀者對(duì)象:本書(shū)適用于大數(shù)據(jù)分析研究人員
Contents
Part One Basics and Concepts Chapter 1 Introduction 3 1.1 What Is Big Data Analytics? 3 1.1.1 Big Data Analytics Requires Data-Driven Business Culture 4 1.1.2 Big Data Analytics Requires High-Performance Analyses 4 1.2 Why Big Data Analytics? 4 1.2.1 History and Evolution of Big Data Analytics 5 1.2.2 The Drivers of Big Data Analytics 6 1.2.3 Why Is Big Data Analytics Important? 6 1.2.4 The Challenges of Big Data Analytics 8 1.2.5 How Big Data Analytics Is Used Today? 10 1.3 Big Data Analytics Applications 10 1.3.1 Industries Where Big Data Analytics Are Successful 11 1.3.2 Four Powerful Big Data Analytics Application Examples 13 1.4 The Big Data Analytics Market 14 1.5 Big Data Analytics Future Trends 15 1.5.1 Predictive Analytics Will Dominate 15 1.5.2 Refocusing on the Human Decision-Making 15 1.5.3 Market Segmentation in Data Analysis Platforms 16 1.5.4 Open Source Software Tools 16 1.5.5 Plug-in AI Technologies 16 1.6 The Contents of Big Data Analytics 17 1.7 References 19 1.8 Review Questions and Exercises 20 Chapter 2 Data and Big Data 21 2.1 Data as a Basic Entity in the DIKW Framework 21 2.1.1 DIKW Framework 21 2.1.2 Data Object, Data Attribute and Data Set 23 2.1.3 Data Attribute Types 25 2.2 Big Data 28 2.2.1 Big Data Definition 29 2.2.2 Big Data Types 33 2.3 Quality of Data and Big Data 37 2.3.1 Definition of Data Quality 37 2.3.2 Data Measurement and Data Collection 38 2.3.3 Errors in Measurement and Collection 39 2.3.4 Data Accuracy 40 2.4 Basic Measurement of Dataset 41 2.5 Summary 42 2.6 References 44 2.7 Review Questions 45 Chapter 3 Big Data Analytics Process 47 3.1 The Process of Data Mining and Knowledge Discovery 47 3.1.1 CRISP-DM Framework 47 3.1.2 KDD Process 49 3.2 Process of Big Data Analytics 51 3.2.1 Acquisition 51 3.2.2 Understanding 51 3.2.3 Preprocess 52 3.2.4 Analysis 52 3.2.5 Reporting 52 3.2.6 Action 52 3.3 Data Preprocess 53 3.3.1 Data Cleaning 54 3.3.2 Data Integration 54 3.3.3 Data Reduction 54 3.3.4 Data Transformation 55 3.4 Big Data Analysis 56 3.4.1 Analysis 56 3.4.2 Types of Big Data Analysis 57 3.4.3 Descriptive Analysis 60 3.4.4 Explorative Analysis 61 3.4.5 Predictive Data Analysis 62 3.5 Summary 66 3.6 References 68 3.7 Questions and Exercises 68 Part Two Technologies and Tools Chapter 4 Supporting Infrastructure 73 4.1 Cloud Computing 73 4.1.1 Essential Characteristics of Cloud Computing 75 4.1.2 Services Provided by Cloud Computing 75 4.2 Distributed Computing 77 4.2.1 Characteristics of Distributed Systems 78 4.2.2 Distributed Systems Composition 78 4.2.3 Distributed State 81 4.2.4 The CAP Theorem 83 4.3 Big Data Systems 86 4.3.1 Requirements for a Big Data System 86 4.3.2 The Problems with Fully Incremental Architectures 87 4.3.3 Lambda Architecture 90 4.4 Summary 96 4.5 References 96 4.6 Questions and Exercises 97 Chapter 5 Hadoop and MapReduce 98 5.1 Computer Cluster 98 5.1.1 Concept of Computer Cluster 99 5.1.2 Attributes of Clusters 100 5.2 Apache Hadoop in a Nutshell 101 5.2.1 History and Overview of Hadoop 101 5.2.2 What Is Hadoop? 102 5.2.3 Components of Hadoop 103 5.2.4 The Hadoop Ecosystem 110 5.2.5 Hadoop Limitations 111 5.3 How Do Hadoop and MapReduce Work? 113 5.3.1 Big Example (WordCount) 113 5.3.2 Scalling WordCount in MapReduce 117 5.3.3 The Driver Method 120 5.4 MapReduce Data Flow 121 5.5 Other Hadoop Usages 123 5.5.1 Chaining Jobs 123 5.5.2 Listing and Killing Jobs 124 5.5.3 Pipes 124 5.5.4 Hadoop Streaming 126 5.5.5 Example of Hadoop Streaming Using Python 127 5.6 Summary 129 5.7 References 129 5.8 Review Questions and Excesses 130 5.9 Practical Tasks (lab tasks) 130 Chapter 6 Apache Spark 132 6.1 Spark in a Nutshell 132 6.1.1 Spark’s Stack 132 6.1.2 Spark’s Usage 134 6.1.3 Spark’s Advantages 135 6.1.4 Fast Application Support 135 6.2 Spark High-level Architecture 136 6.2.1 How Does a Spark Application Work? 137 6.2.2 Application Programming Interface (API) 138 6.3 Programming with RDDs 139 6.3.1 Steps for Program with RDDs 140 6.3.2 Spark Shell 140 6.3.3 RDD Creation 141 6.3.4 RDD Operations 142 6.3.5 Actions 144 6.3.6 Checking the Output 145 6.4 Spark Application Development and Deployment 145 6.4.1 Spark Jobs 146 6.4.2 Shared Variables 146 6.4.3 General Steps for Create a Spark Application 148 6.5 Summary 150 6.6 References 150 6.7 Questions and Exercises 150 6.8 Practical Tasks (lab tasks) 151 Chapter 7 NoSQL and MongoDB 152 7.1 NoSQL in a Nutshell 152 7.1.1 What Is NoSQL? 152 7.1.2 Why NoSQL? 153 7.1.3 The CAP Principle 154 7.1.4 ACID Rules 154 7.1.5 BASE Rules 155 7.1.6 Benefits of NoSQL 156 7.1.7 Types of NoSQL Databases 158 7.2 NoSQL and Hadoop Integration in Big Data Analytics 161 7.2.1 OLTP vs OLAP 161 7.2.2 Operational vs Analytical View of NoSQL 162 7.2.3 NoSQL Integration with Hadoop 162 7.3 MongoDB 163 7.3.1 MongoDB Basics 164 7.3.2 MongoDB Architecture 165 7.3.3 MongoDB Data Modelling 167 7.3.4 MongoDB Data Representation 173 7.3.5 MongoDB CRUD Operations 175 7.4 Big Data Analysis with MongoDB 179 7.4.1 Aggregation 179 7.4.2 MongoDB with MapReduce 181 7.4.3 MongoDB with Hadoop 184 7.5 Summary 186 7.6 References 188 7.7 Questions and Exercises 189 7.8 Practical Tasks (lab tasks) 190 Part Three Methods and Algorithms Chapter 8 Data Preparation 195 8.1 What is Big Data Preparation? 195 8.2 Data Cleaning 196 8.2.1 Fill in Missing Values 196 8.2.2 Identify Outliers and Smooth Out Noisy Data 197 8.2.3 Correct Inconsistent Data 199 8.3 Data Integration 201 8.3.1 Entity Identification Problem 201 8.3.2 Redundancy Identification 202 8.3.3 Data Deduplication 207 8.4 Data Reduction 208 8.4.1 Overview of Data Reduction Strategies 208 8.4.2 Reducing the Number of Data Records 209 8.4.3 Reducing the Number of Attributes 215 8.4.4 Reducing the Number of Attribute Values 223 8.5 Data Transformation 228 8.5.1 Data Transformation Strategies Overview 228 8.5.2 Normalisation 229 8.5.3 Generalisation 232 8.6 Data Discretisation and Binarisation 234 8.6.1 Binarisation 235 8.6.2 Discretisation 236 8.7 Summary 242 8.8 References 243 8.9 Questions and Exercises 244 Chapter 9 Descriptive Data Analysis 248 9.1 Descriptive Data Analysis 248 9.2 Univariate Descriptive Analyses 250 9.2.1 Simple Data Summary 251 9.2.2 Location Measures 252 9.2.3 Percentiles 255 9.2.4 Dispersion Measures 256 9.2.5 Distribution or Shape Measures 257 9.3 Multivariate Descriptive Analyses 261 9.3.1 Contingency Table for Categorical Data 261 9.3.2 Multivariate Statistics on Categorical and Continuous Variables 262 9.3.3 Multivariate Summary on Quantitative Variables 262 9.3.4 Covariance and Correlation Matrices 263 9.4 Descriptive Analysis between Data Objects 264 9.4.1 Definitions of Similarity, Dissimilarity and Proximity 265 9.4.2 Proximity between Data Objects with Single Attribute 267 9.4.3 Proximity between Data Objects with Multiple Attributes 268 9.4.4 Proximity Analyses Issues and Selections 279 9.5 Association Analysis 282 9.5.1 Problem Definition 284 9.5.2 Frequent Itemset Generation 286 9.5.3 Association Rules Generation 301 9.5.4 Alternative Association Analysis 304 9.5.5 Evaluation of Association Patterns 316 9.5.6 Applications of Association Analysis 316 9.6 Summary 317 9.7 References 319 9.8 Questions and Exercises 320 Chapter 10 Explorative Data Analysis 326 10.1 Explorative Analysis Approach 326 10.1.1 Motivations for EDA 328 10.1.2 Definition of Exploratory Data Analysis 330 10.2 Univariate Graphical EDA 333 10.2.1 Stem and Leaf Plot 334 10.2.2 Histograms 334 10.2.3 Box Plot 338 10.2.4 Pie Chart 340 10.2.5 Bar Chart 341 10.2.6 Percentile Plots 342 10.2.7 Scatter Plots 344 10.2.8 Quantile-Normal Plots 345 10.3 Multivariate Graphical EDA 349 10.3.1 Generic Approaches for Multivariate 349 10.3.2 Extending 2-Dimensional and 3-Dimensional Plots 351 10.4 Data Visualisation 353 10.4.1 Pixel-Oriented Visualisation Techniques 353 10.4.2 Geometric Projection Visualisation Techniques 354 10.4.3 Icon-Based Visualisation Techniques 356 10.4.4 Visualising Spatio-Temporal Data 358 10.4.5 Animation 361 10.4.6 Do’s and Don’ts of Visualising Data 361 10.5 Multidimensional Data Analysis (OLAP) 362 10.5.1 Data Cube: A Multidimensional Data Model 363 10.5.2 Typical OLAP Operations 367 10.5.3 General Procedure Using Data Cubes and OLAP 371 10.6 Data Clustering 371 10.6.1 What Is Clustering? 372 10.6.2 Basic Clustering Techniques 379 10.6.3 Partitioning Clustering Methods 381 10.6.4 Hierarchical Clustering Methods 391 10.6.5 Density-Based Methods 411 10.6.6 Clustering with Mixed Methods 418 10.6.7 Clustering Evaluation 423 10.7 Summary 432 10.8 References 436 10.9 Questions and Exercises 437 Chapter 11 Predictive Data Analysis 443 11.1 Introduction to Predictive Data Analysis 443 11.1.1 What Is Predictive Data Analysis? 443 11.1.2 Predictive Data Analysis History and Its Applications 444 11.1.3 The Predictive Analytics Process 446 11.1.4 Tools and Software 450 11.2 Process of Building Predictive Models 452 11.3 Predictive Models 457 11.3.1 Predictive Model Types 457 11.3.2 Regression Models 459 11.3.3 Rule Based Models 467 11.3.4 Machine Learning Techniques 477 11.4 Predictive Models Evaluation 496 11.4.1 Confusion Matrix 496 11.4.2 Gain and Lift Charts 498 11.4.3 K-S Chart 501 11.4.4 Area Under the ROC Curve (AUC – ROC) 503 11.4.5 Gini Coefficient 505 11.4.6 Cross Validation 505 11.4.7 Root Mean Squared Error (RMSE) 506 11.5 Classification Problem 507 11.5.1 Basic Concepts 507 11.5.2 Decision Tree Induction 508 11.5.3 Overfitting and Tree Pruning 521 11.5.4 Evaluating the Performance of a Classifier 528 11.5.5 Comparing the Performance of Two Classifiers 530 11.6 Recent Applications of Predictive Data Analytics 537 11.6.1 Customer Relationship Management (CRM) 537 11.6.2 Risk Management and Fraud Detection 539 11.6.3 Clinical Decision Support Systems (CDSS) 540 11.6.4 Future and High-Level Economy Prediction 541 11.7 Summary 541 11.8 References 545 11.9 Questions and Exercises 547 Part Four Social, Ethical and Organisational Issues Chapter 12 Ethics, Governance and Security of Big Data 559 12.1 12 V’s of Big Data 559 12.2 Ethics of Big Data 561 12.2.1 Relevancy of Ethics in a Big Data World 561 12.2.2 Big Data Analytics Ethical Awareness Framework 563 12.2.3 Big Data Ethics in Practice 564 12.3 Governance of Big Data 566 12.3.1 The Definition 567 12.3.2 Big Data Governance Framework 567 12.4 Big Data Privacy and Security 570 12.4.1 Big Data Privacy: The Great Fear 571 12.4.2 Data Collection: Understanding Privacy’s First Frontier 572 12.4.3 Big Data Security: The Foundation of Privacy 575 12.5 Case Studies 577 12.5.1 Google Street View Wifi: Inadvertent Over-Collection of Data 577 12.5.2 IPhone Location Database 578 12.5.3 A Chinese Case 578 12.6 Summary 579 12.7 References 580 12.8 Questions and Exercises 582 Chapter 13 Building Data-Driven Business Organisations 583 13.1 What Is a Data-Driven Organisation? 583 13.1.1 Definition of Data-Driven Organisation 584 13.1.2 Prerequisites of Data-Driven Organisations 584 13.1.3 Activities a Data-Driven Organisation Ought to Do 585 13.2 Organisational Big Data Analytics Maturity Models 586 13.2.1 SAS’ Eight Levels of Analytics Maturity Model 586 13.2.2 TDWI Data Governance Maturity Model 587 13.2.3 Analytics Business Maturity Model 588 13.2.4 DataFlux Data Governance Maturity Model 589 13.2.5 Gartner Enterprise Information Management Maturity Model 589 13.2.6 IBM Big Data Analytics Maturity Model 590 13.3 How to Build a Data-Driven Organisation? 591 13.3.1 Understand the Business 591 13.3.2 Aligning Big Data Initiatives to Business Goals and Strategy 593 13.3.3 Decision Making Based On Data Evidence 595 13.3.4 Build the Big Data Team 599 13.3.5 Adopt Best Practices with Big Data 601 13.3.6 Top 10 Priorities for Big Data Management (Russom 2013) 603 13.4 Big Data Analytics Innovation Examples 605 13.4.1 DeepGlint 606 13.4.2 Essentia Analytics 606 13.4.3 Catapult 607 13.4.4 Next Big Sound 607 13.4.5 Mark43 608 13.4.6 Netflix 608 13.4.7 Poshly 609 13.4.8 Ayasdi 609 13.4.9 Frost Data Capital 609 13.4.10 Splunk 610 13.4.11 Sumall 610 13.5 Summary 611 13.6 References 612 13.7 Questions and Exercises 613
你還可能感興趣
我要評(píng)論
|