The data analysis
Direction of analysis:
- Batch processing:Analyzing existing data from the past, focusing on historical information. This involves making significant batch changes in the time dimension. (Performing one analysis per week or one analysis per day)
- Real Time Processing, Streaming: for the moment, analyzing data generated in real time, divided into milliseconds and microseconds.
- Predictive analytics, machine learning: predicting future events based on historical and real-time data, focusing on the application of mathematical algorithms such as classification, clustering, correlation, prediction
Steps:
- Ask a question
- Obtain data: data from scratch, data transfer and handling (business data, log data, crawler data, open internet data)
- Data Processing: Data Cleaning, Data Transformation, Data Extraction, Data Calculation to get clean and structured data
- Data Analysis: PEST Analysis (Political, Economic, Social, Technological)
- Data Presentation: Data Visualization
- Report Writing
Big data:
A collection of data that cannot be captured, managed, and processed within a certain timeframe using conventional software; a massive, high-growth, and diverse information asset that requires a new processing model in order to have stronger decision-making, insight discovery, and process optimization capabilities.
- Volume 大量的
- Variety 种类来源多样化
- Value 价值密度低
- Velocity 速度快
- Veracity 真实度高
Distributed vs. Clustered
- Distributed: multiple machines, each deploying different components (distributed storage, distributed computing)
- Clustering: multiple machines, each deploying the same components
Data processing architecture
Traditional data processing architecture
- Transaction processing (OLTP) involves handling low amounts of data with a fast response time.
- Analytical processing (OLAP) deals with high volumes of data but has a slower response time.
Streaming architecture
- Lambda: two systems, both low latency and accurate results Good results, hard to iterate
- Kappa is a system that offers both low latency and accurate results. It utilizes the new generation stream processor, Flink.