A Data Scientist’s Guide to Harnessing Distributed Compute in Python
Data scientists are charged with making data useful. To deliver value to an organization, they need the right combination of hardware and software to generate the computing power they need to prepare, analyze, and make use of the data.
With more data being generated every day and an ever-expanding ecosystem of software and tools available to analyze it, data scientists face both an opportunity and a challenge: they can use data to solve problems, improve experiences, and innovate products. However, they must identify the best combination of processes and technology to realize those benefits.
Distributed computing has emerged as a virtual necessity for data science practitioners who are working on projects that require the use of massive datasets that cannot be processed within the confines of a single machine. It allows them to use computing resources across a network of machines that can coordinate with each other.
In this white paper, we’ll explain the growing demand for distributed compute. We’ll look at typical data science working environments, why iteration and speed are critical, and examples of workflow challenges.
We’ll also explore:
- Hyperscaling and the lifecycle of a data science project
- Options for overcoming infrastructure limitations
- 5 typical execution stages of a data science project iteration
- 4 key characteristics of a data scientist’s work and workflow
- How parallelism and distributed systems support data science at scale
- Practical options for distributed computing, as well as challenges to expect