When thinking of data structure, think of your car keys. Suppose you’ve lost your car keys at least once a week, and there seems to be no pattern for where they end up. But, alas, you’re a bit of a hoarder: you have to slip sideways through doorways because of the piles of magazines, children’s toys, and boxes stacked ceiling-height. How long do you think it would take to find your keys?
But what if your home was organized like Martha Stewart’s: items are categorized; like is kept with like, arrayed in pleasing symmetrical order. How easy would it be to find your car keys there?
Why data structure is important
When you’re dealing with terabytes and petabytes of data that you somehow need to analyze and make sense of, the magnitude of the problem is clearly more complex than finding a set of car keys. In summary, the use of basic data structures are to:
- Input information
- Process information
- Retrieve information
- Maintain information
Data structures are the means by which you can input and organize data in order to do something with it. A program’s algorithm is responsible for the “do something with it” part—it’s the underlying logic that processes and performs operations on your data—so data structures and algorithms are inextricably linked.
When you’re retrieving something—for example, you want to know in real time how your customers are talking about your product across all social networks—the speed and efficiency of your search is of utmost importance. Structuring the data is key to maximizing the speed and efficiency of your searches so that you can get to the analysis as quickly as possible.
And, over time, data structures are the key to maintaining data. Data structure is the backbone to adding clear metadata and the basis of applying data cleansing techniques, which ensures that data is useful and relevant.
Examples of data structures
There are many types of data structures and knowing which one to use is key to proper organization. To have a well-oiled machine, you don’t always need fancy, advanced data structures. Often, simple solutions are the most effective. Here are a few basic data structures:
- Arrays are a fixed-length list made up of a collection of objects or values; it’s the most basic data structure. An array lets you determine the position of each object or value using a mathematical formula. Example: Calculating racing results in a field of 300 runners.
- Queues structure data in a “first in, first out” order. Just like a real-life queue (or line) of people at a bank, the first one to arrive also leaves first. Example: Serving print requests on a single shared printer.
- Stacks structure data in a “last in, first out” order. Example: Undoing an action in a computer program (one of the more ubiquitous examples is the Back button in your browser).
- Trees are a hierarchical data structure that consists of one or more data nodes. The first node is called the root; each node can have zero or more child nodes. While trees can be considered one of the basic data structures, it would be more accurate to say that they’re a fundamental structure. Because the tree hierarchy is often used in building some of the most advanced data structures. Example: Storing data that naturally forms a hierarchy, such as an org chart.
Choosing the best data structure
When choosing your own data structure, it’s important to uniquely evaluate the information at hand—there is no one data structure that is proven to be the best structure for all situations. For example, what is the intended purpose for the data? And are certain data structures better suited for this purpose? How would you like to search for your data? Is it more important for you to insert items into your data structure or to access items?
Often, there is a trade-off between creating a custom data structure that works best for your intended purpose and using a more common structure that will be easier to maintain over time. In that case, it’s best to look at the longevity of the project. Is this something that will be used for many years or does it have a shorter lifespan?
However, in many cases, you have no control over the data structures you’re working with. External data, in particular, can arrive in all different structures and formats—if you want to leverage it, you must work with that structure’s particular challenges and constraints. It’s important to remain flexible and have the right tools set in place to deal with all sorts of data structures.
What is data structure’s role in data wrangling?
“Data wrangling” is the process of taking data in its native format and making it usable for analysis. Structuring the data is only one of the six processes involved in data wrangling—but structuring data is integral to the success of your big data efforts:
“If you talk to any data scientist worth their salt, they will tell you that the first challenge of putting data to work for your business is getting it into a structured format so that you can analyze, interpret and make decisions around your data. This is lovingly referred to as “Data Wrangling” and it’s what sucks up the bulk of the unproductive (wasteful) time (4 out of 5 days, by most accounts). That’s because instead of spending time understanding the data, you’re wasting time trying to pull it all together in a usable format. This is what usually creates the bottleneck in any organization.”
What is data structure? In short, it’s the first, critical step to actually doing something useful with your data.