Data Logistics Flashcards

Question 1

Q

What is the difference between structured, semi-structured, and unstructured data, and how is the abundance of unstructured data being addressed in data science?

Answer

A

Structured Data: Highly organized with predefined variables and formats, like spreadsheets and relational databases. Easy to analyze but less common.
Semi-Structured Data: Has some structure but lacks a rigid format. Uses tags to identify variables. Easier to analyze than unstructured data. Examples include HTML, XML, and JSON.
Unstructured Data: Data without any predefined labels or organization, like text, images, videos, or audio. Most abundant but requires the most effort to analyze.

The challenge of analyzing abundant unstructured data is being addressed through:

**Machine Learning: **Algorithms are used to identify patterns and structures within unstructured data, making it easier to process.
Deep Learning: A subset of machine learning that uses neural networks to uncover deeper patterns in unstructured data, like recognizing objects in images or understanding natural language.
Manual Labeling: Humans tag and categorize unstructured data to add structure, which can be time-consuming but is often necessary for training machine learning models.

These techniques enable data scientists to extract valuable insights from unstructured data, which was previously a major obstacle in the field.

Question 2

Q

What are the key differences between batch processing and stream processing, and when should each be used?

Answer

A

Batch Processing
Data State:Data at rest (static, unchanging dataset)
* Analysis TypeDetailed, in-depth, time-consuming models possible.
* Goals:In-depth understanding, pattern finding, grouping, prediction
* InfrastructureTypically stored in databases or files, processed with tools like Hadoop or Spark
* Use CasesAnalyzing customer purchase history, generating financial reports, training machine learning models

Stream processing
Data State:Data in motion (continuous arrival, potentially discarding old data)
* Analysis TypeQuick trends, immediate anomaly detection, real-time responsiveness
* Goals:Immediate action, triggering actions based on real-time data.
* InfrastructureRequires specialized stream processing engines like Apache Kafka or Apache Flink
* Use CasesFraud detection, system monitoring, real-time recommendations, algorithmic trading.

Question 3

Q

Why are distributed storage and processing essential for big data, and how have these technologies evolved?

Answer

A

Big data often exceeds the capacity of a single computer or system, necessitating distributed solutions that spread storage and processing across multiple machines.

Evolution of Technologies:

Apache Hadoop: Pioneered distributed storage, enabling data to be spread across many computers and providing redundancy and fault tolerance. However, it was limited in processing capabilities.
Cloud Services: Platforms like AWS, Google Cloud, and Azure emerged to offer both storage and processing, providing scalable and flexible solutions for big data analysis.
**Containers (e.g., Docker): **Enabled the creation of multi-platform big data applications that can run on various architectures, increasing flexibility and portability.

The Future:

The field of big data storage and processing continues to evolve rapidly. Staying informed about new technologies is crucial to effectively manage and analyze big data in the future

Question 4

Q

How can organizations navigate the constantly evolving data landscape to achieve their goals?

Answer

A

To thrive in the ever-changing data landscape, organizations should:
Embrace Change: Be open to new technologies, hardware, software, and methods as they emerge.
Prioritize Integration: Focus on integrating new innovations with existing systems and legacy technologies to create a cohesive data environment.
Cultivate Flexibility: Be willing to experiment with new approaches while maintaining valuable components of existing infrastructure.
Focus on Goals: Evaluate new technologies and methods based on their ability to contribute to achieving organizational objectives.
**Invest in Continuous Training and Support: **Provide ongoing training and support to all members of the organization to ensure they understand and can effectively leverage data.

By following these principles, organizations can navigate the evolving data landscape, derive meaningful insights, and gain a competitive advantage in their field.