Skip to content

Latest commit

 

History

History
190 lines (125 loc) · 10.9 KB

64. Data Ingestion.md

File metadata and controls

190 lines (125 loc) · 10.9 KB

Data Ingestion

Just like we eat food to get energy, computer systems "eat" or "ingest" data to get information. 🍽️💻

A data ingestion system is a framework or set of processes that facilitates the transfer of data from diverse sources to a storage medium where it can be accessed, used, and analyzed.

Data Collection

Different data come from different places, like databases, websites, or even smart devices. We'd use things called protocol handlers to make sure our system can "talk" to these sources without any mix-ups.

For data that's scheduled, like grabbing daily weather reports, we'd set up a scheduler to do this job. And for real-time stuff, like live tweets or instant updates, tools like stream processors come into play.

Now, imagine everyone trying to enter a room at once; it'd be chaos. The same can happen with data. So, we'd use backpressure strategies to manage when too much data tries to come in at the same time.

We'd also make sure all our data travels and sits safely. That means it's encrypted or coded to keep it private and secure, especially when it's on the move.

And finally, we'd keep a close eye on how things are going. Using monitoring tools, we can see if everything's working right or if there are any hiccups.

Protocol Handlers

In computing, protocols define the rules for how data is transmitted and received over a network. Protocol handlers interpret and implement these rules for specific types of communication protocols, such as HTTP for web traffic or FTP for file transfers. They ensure that data is correctly formatted, transmitted, and understood between different systems or software.

Schedulers

  • Time-Based Scheduling: Tasks are run at specific times or intervals.

  • Event-Driven Scheduling: Tasks initiated in response to specific events or triggers.

  • Priority-Based Scheduling: Tasks are prioritized, and the scheduler picks the highest priority tasks available.

  • Dependency Scheduling: Tasks are scheduled based on the completion or state of one or more preceding tasks.

Stream processors

Stream processors continuously handle and analyze data in real-time (or near-real-time) as it arrives.

Typical operations: transformation, aggregation, filtering, enrichment.

Given the continuous nature of streams, data is often divided into "windows" for analysis. This can be based on time (like every 10 seconds) or event count (like every 1,000 events).

Backpressure

In the realm of data processing and streaming, backpressure is a mechanism where receiving components can signal to their producers about their current availability, ensuring data isn't sent faster than what the receiver can process. It's crucial in maintaining system stability and preventing resource exhaustion.

Data Processing & Transformation

We start by deciding how to handle the data. Sometimes we look at it in big chunks all at once, which we call batch processing. Other times, we examine it immediately as it comes in, known as stream processing.

To manage this incoming data, especially when there's a lot, we use a temporary storage or "waiting line" where data can briefly pause. This mechanism helps, especially when too much data comes rushing in too fast, letting the system ask for a pause or slow-down.

Some tasks require us to remember previous pieces of data, which is called maintaining state. Depending on the situation, our system might use past data (stateful) or treat each piece of data as brand new (stateless). For stateful processing, it's essential to define the state,

As we work through the data, we often need to adjust or change it to be more useful, a process called transformation. Sometimes, this means adding more details to our data, like if we know someone's city, we might add its population – this is termed enrichment. It's also essential to ensure all our data looks and feels consistent, so we standardize or normalize it.

As we're working, it's a good habit to periodically save our progress. That way, if something goes out of control, we don’t have to redo everything. This saving is called checkpointing.

And, to ensure smooth operations, we design the system so that even if we accidentally repeat an action, it doesn't cause issues. That's called idempotence.

Lastly, when our system takes action, we make sure it’s either fully completed or, if there's a hiccup, fully reversed, ensuring the data's integrity. That's called atomicity.

Stateful processing

Stateful processing involves a system's ability to retain historical data for context. It starts by determining what information to remember and where to store it, be it in quick-access in-memory storage like RAM or more durable solutions like databases. Proper state management requires initializing, updating, and occasionally discarding outdated information. With potential concurrent access, strategies like locks ensure data consistency. As systems scale, state might spread across multiple machines, making distributed databases or caches useful. For resilience, the system can save periodic "checkpoints" or maintain data replicas. Performance can be boosted with techniques like caching, while security remains paramount. Regular monitoring ensures the system's reliability.

Idempotence We achieve idempotence by carefully designing our API endpoints and utilizing unique identifiers, ensuring that repeated actions have consistent and predictable outcomes.

Atomicity

  • Transactions: Most databases support transactions, where a series of operations are grouped and executed as one unit. If any operation in the transaction fails, the whole transaction is rolled back to its initial state.

  • Two-Phase Commit: For distributed systems, where multiple databases or services are involved, a two-phase commit ensures atomicity across them all. First, all participants "vote" on whether they can commit. If all vote yes, a commit happens; otherwise, it's rolled back.

  • Compensating Transactions: In long-running processes, immediate rollback might not be feasible. Instead, if something goes wrong, a compensating transaction is executed to counteract the earlier operations and bring the system back to a consistent state.

  • Locks and Concurrency Control: To prevent multiple operations from interfering with each other, especially in multi-user systems, use locks or other concurrency control mechanisms.

Data Storage

When it comes to storing our data, we first need to understand its nature and our access patterns. For structured data, like tables with defined fields, we use relational databases such as MySQL or PostgreSQL. If our data is more flexible or hierarchical, NoSQL databases like MongoDB or Cassandra might be more suitable.

The volume of our data also guides our choices. If we're looking at massive datasets that require distributed storage and fast processing, big data solutions like Hadoop's HDFS or Google's Bigtable come into play. For temporary or cache data that needs quick retrieval, in-memory databases like Redis or Memcached are ideal.

When we talk about the longevity of our data, backups become essential. Regularly backing up our data ensures that, in the event of failures or corruption, we don't lose crucial information. These backups can be incremental, where only changes are saved, or full, where the entire dataset is copied.

Data safety is paramount. Hence, we ensure encryption, both at rest and in transit. Encryption scrambles our data, making it unreadable without the right decryption keys, safeguarding it from unauthorized access.

As our data grows or our user base becomes more geographically diverse, we might use sharding or replication. Sharding divides our database into smaller, more manageable pieces, while replication creates copies of our data to improve access speed and reliability.

Backups

Backups are our safety net against data loss. By regularly copying and storing our data, either on-site, off-site, or on cloud platforms, we ensure that we can always recover essential information. Depending on our needs, these backups can be full, where all data is copied, or incremental, capturing only the changes since the last backup. The frequency, method, and storage of these backups are tailored based on the criticality of the data and our recovery objectives.

Sharding and Replication

To handle large volumes of data or to ensure quick access regardless of geography, we use strategies like sharding and replication. Sharding divides our data into chunks, distributing them across multiple servers, ensuring our system remains responsive and scalable. Replication, on the other hand, involves creating copies of our data. These replicas can serve read requests, balancing the load, and also act as a fail-safe in case the primary data source encounters issues.

Case Study: Data ingestion pipeline with Operation Management - Netflix

Data Generation

The Media Algorithm teams at Netflix run algorithms on media files, like movies or shows. These algorithms analyze the content and generate data, such as identifying objects in a video.

Data Storage

This generated data, referred to as annotations, is ingested into the Marken system. Marken stores these annotations in a structured manner using Cassandra, a database system.

Data Updates

As algorithms improve or change, they might generate new or different data for the same media files. This means that the system needs to ingest updated data and manage it in a way that doesn't conflict with previous data. Instead of updating annotations directly, it manages different sets (or versions) of annotations separately. The approach of not directly updating annotations from previous runs ensures efficiency, maintains historical data, and accommodates the unpredictability of algorithm outputs.

Data Retrieval

ElasticSearch, another system used alongside Cassandra, helps in quickly searching and retrieving the ingested data. This is crucial for providing real-time recommendations to Netflix users.

Marken Architecture

Marken's architecture is designed to ensure that the teams or systems producing annotations operate independently of the teams or systems using those annotations.

How Marken separates them:

  • Different Data Paths: Producers put data into Marken, and consumers take data out. They use separate tools (APIs) for these tasks, so they don't interfere with each other.

  • Version Handling: Producers can add new data versions without messing up old ones. Consumers always see the latest version without worrying about ongoing updates.

  • By using ElasticSearch for searches, consumers don't directly query the primary storage (Cassandra). This separation ensures that frequent searches by consumers don't impact the performance or workload of the primary database.

References

ChatGPT 4 with plugins

https://netflixtechblog.com/data-ingestion-pipeline-with-operation-management-3c5c638740a8