Read any popular business magazine and you’re sure to find an article about how data science is the wave of the future. Since 2011, after fifty years of wandering through the halls of academia, real world employment of data scientists has skyrocketed. Last year, the Harvard Business Review declared the job of Data Scientist to be the Sexiest Job of the 21st Century. Unfortunately, there is no consensus on what being a data scientist actually means.
Data scientists come from the ranks of statisticians, mathematicians, accountants, data engineers, programmers, database managers, data miners, business analysts, risk assessors, and specialists in visualization, machine learning, pattern recognition, simulation, predictive modeling, and quality management (e.g., Six Sigma Green and Black Belts). Programmers might specialize in any of a dozen different languages. Even within statistics, there are data scientists from very disparate branches based on domain expertise and methods, such as: survey statistics; mathematical statistics; biostatistics; chemometrics; geostatistics; epidemiology, business statistics; psychometrics; and econometrics. No individual data scientist knows and uses but a small fraction of the hundreds of methods available for managing and analyzing data. Is it any wonder, then, that it’s impossible to define what a data scientist does except in very general terms. Even universities can’t agree on what a curriculum for training data scientists should look like.
So how might a data scientist describe his or her interests and skills succinctly in a field in which practitioners come from so many different backgrounds? This is a problem akin to typologies for personality assessment, like the Myers-Briggs Type Indicator (MBTI). Such typologies don’t cover every possible characteristic, but rather, summarize a few key dimensions.
In this typology of data scientists, there are five sets of descriptors representing spectra of preferences, skills, or predominant activities. Each data scientist chooses the categories that best describe his or her skills and preferences. An individual data scientist might have many skills and preferences, but only certain of them would predominate. They might also change over time. A typology of data scientists would be a simple way to identify key characterize that others can quickly understand and use to facilitate working together.
So, here’s what a typology of data scientists might look like.
Every data scientist has a set of methods he or she is familiar with, usually based on training and reinforced by experience. The methods a data scientist uses could be said to fall into two categories — organization and analysis — although there is overlap. Data scientists tend to use methods that are predominately one type or the other. Organizers favor methods involving programming, data warehousing, database management, data parsing, merging, extracting, sorting, filtering, clustering and classification. They tend to be computer scientists, programmers, database managers, data miners, and mathematicians. They often work with Big Data that is compiled, validated, and processed in real time. Analyzers favor methods involving data description, hypothesis testing, and predictive modeling. They tend to be statisticians, business analysts, quality managers, risk assessors, and predictive modelers. They usually work with static datasets that have been extensively scrubbed in preparation for analysis.
There are many different methods a data scientist can rely on, be they programming languages or analysis techniques. In practice, each data scientist tends to have a set of core methods that he or she uses routinely. Usually, the methods are what they learned in school or have found to be successful in their work. Sometimes, the methods are research favorites or specialties they offer for an advantage over business competitors. That leads to two types of data scientist on a spectrum of work practice — generalists and specialists. Generalists will use a variety of methods and software, even going so far as to learn new analysis techniques or programming languages that might be applicable to a given dataset. Specialists will rely on techniques they know well and have used extensively in the past, modifying design elements and method specifications to find the best result for a dataset.
Data scientists also have a tendency to focus on either the domain of the data or a method’s fundamental characteristics. Domain experts honor the sources, meanings, and limitations of the data elements they are studying. They tend to be goal oriented and methodologically flexible. They are often willing to “bend the rules” a bit in order to conduct an analysis. They will use data transformations and other model optimization techniques. They will examine violations of assumptions to assess the severity of impact and possible corrective measures before foregoing a planned analysis. They’ll even consider using unconventional and controversial approaches if they believe the action is warranted. Methods experts understand the mathematical foundation of their analysis technique and how it is implemented by software. They often write their own code, even for routine tasks. They tend to follow rigorous plans and procedures. They “play by the rules.” They avoid deleting outliers and using transformations and stepwise techniques that might capitalize on chance. They will switch to alternative analytical methods upon violation of a method’s assumptions.
Credentials are embodied in education and experience, the more the better, at least in general. Beyond that, it’s impossible to quantify credentials. Education stresses theory; experience stresses application. A good education involves brief exposure to a wide variety of ideas; experience involves a much longer exposure to fewer ideas. A degree represents a package of learning that may or may not be relevant to a job. On the job experience is always relevant but may represent either a continually advancing set of skills or the same set of skills repeated again and again.
Data scientists almost always have a relevant degree to begin in the profession. After that, each additional year in school is probably worth two to four years of experience, some say more and some say less. Experience has to be progressive. The first five years is often spent learning about the working world, such as what to do when the boss tells you to make a pie chart. The next five to fifteen years is learning how data scientists solve problems. From fifteen to thirty years, you lead projects using data science to solve problems. You also get to bedevil recent graduates by telling them to summarize their regression-on-principal-components model in a single PowerPoint slide. After thirty years, you just tell stories about how you used to do ANOVAs with a pencil and paper.
So, characterize credentials with the combination of degree+experience. Recognize, though, that credentials are difficult to express in a word like the other characteristics of data scientists. Furthermore, degrees and experience are different for every type of data scientist. A BS+1 programmer or mathematician might be on the verge of a major breakthrough. Bill Gates, for example, didn’t have any credentials when he started Microsoft. In contrast, an MS+5 applied statistician is just starting out.
The final characteristic of a data scientist is how they communicate, or at lease prefer to communicate, not in terms of media, but rather, in terms of audience and content. Think of the audience as either:
- Inward. Communications inward involves your peers in school, at work, and in the data science profession. These are people who you could show computer code or matrix formulas to and expect that they would be interested.
- Outward. Communications outward involves people who aren’t data scientists but may be interested in what you do, though not in your code or formulas. They may be co-workers, a class you teach, an invisible audience on the internet, or the general public.
Content can be categorized as:
- Top-down. High level, top-down communications that usually involve ideas, concepts, trends, patterns, summaries, mathematical laws, and other general information.
- Bottom-up. Communications involving specific methods, formulas, code, data structures, programming practices, data elements, and other details of data science.
These distinctions form four categories.
Communications involving Overviews
Communications involving Details
|These communicators are often experts, visionaries, and leaders in the data science profession. They can also be people selling software and services to data scientists.||These communicators are often journal authors, specification writers, and others who provide documentation for program code, database structures, and analysis methods and results.|
|These communicators are often bloggers, reporters, columnists, teachers, and others who present information to the public. They can also be individuals who present data science information to decision makers who are not data scientists.||These communicators are most often college professors and expert witnesses because there aren’t many audiences that consist of individuals who are not data scientists but who want to hear about some details of data science.|
Pick the category of communications that you do the most, are best at, or are most comfortable with. That’s your data scientist communication style. And remember the following:
The ability to visualize and communicate data is critical, because even with good data and rigorous statistical techniques, if the results of an analysis are poorly visualized, they will not convince: whether it’s an academic discovery or a business proposal.
The Three Sexy Skills of Data Geeks, by mike, May 27th, 2009, http://www.dataspora.com/2009/05/sexy-data-geeks/
So, whether you’re recent graduate data engineer (e.g., a BS+0 specialist organizer with a method-focus and a bottom up-inward communication style), or you are an experienced applied statistician (e.g., an MS+35 generalist analyzer with domain focus and a top down-outward communication style), you can express what kind of data scientist you are in just a few words. But more importantly, you can also appreciate how many different types of data scientist there are and where you fit into the profession.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.
Where do spreadsheet specialists fit into this equation? Are we data scientists? I can serve up a pivot table, merge/split/sort data and do a simple regression in R.. I’ve used Weka/Rapid Miner for Clustering, Neural Networks, Bayesian Networks as well. I’ve done some visualization in R/Tableau/Excel/<>. I’ve gotten past the stage of just querying and putting information together and have used visualization to find insights. I’ve watched YouTube videos on Time-Series analysis, mentioned to the project manager at work that we should use a burn-down chart. Am I a data scientist or just a wanna-be hack?
Data scientist for sure and a very broad based one at that. You don’t have to be on the cutting edge of sophistication to be a data scientist. You don’t have to be a recognized expert in both programming and statistics to be a data scientist. It’s like the medical profession. Not every doctor is a brain surgeon or a cancer researcher. Most of the beneficial medicine done in the world is done by general practitioners, even if they don’t get the recognition of the high-profile cardiologists. The difference between a professional and a hack is how well you do the job.
Cool. I feel like I belong now 🙂
Unfortunately no word about a stage or domain or else where/when/why cats involvement becomes ineviatble ;(
Very comprehensive! Well done!
Pingback: Will the real Data Scientist please stand up? « Hearing the Oracle
Very well thought out article. Thank you! I always thought cats make very good points.
Pingback: It’s Hard to be a Data-Driven Organization | Stats With Cats Blog
Pingback: It’s hard to be a data-driven organization – The Future of Market Analysis
Pingback: It’s hard to be a data-driven organization – Big Data Made Simple – One source. Many perspectives.
Pingback: It’s hard to be a data-driven organization – Big Data Made Simple