High-Performance Data Science with Modern C++

This is the home of an executable book project about using Modern C++ for high-performance data science.

It’s a companion to a series of talks by Armin Sobhani for the Compute Ontario Colloquia.

It’ll be updated as more talks in the series are delivered.

Try in a Container 🛠️

docker run -p 8888:8888 -it --rm asobhani/high-performance-data-science-with-modern-cpp

Or this one with CUDA support:

docker run --gpus=all -p 8888:8888 -it --rm asobhani/high-performance-data-science-with-modern-cpp:latest-cuda

apptainer run docker://asobhani/high-performance-data-science-with-modern-cpp:latest

Or this one with CUDA support:

apptainer run --nv docker://asobhani/high-performance-data-science-with-modern-cpp:latest-cuda

C++ vs. Python for Data Science¶

😌 Ease of Use

📚 Community and Libraries

C++'s ecosystem is not as extensive as Python’s for data science 👎
Python has extensive libraries like NumPy, Pandas, Matplotlib, etc. and a large and active community 👍

🏃 Performance

🔀 Concurrency

C++ has built-in support for concurrency (C++11) and parallel algorithms (C++17) 👍
Python’s global interpreter lock can be a limitation for multi-threaded applications 👎

💼 Memory Management

C++ offers fine-grained control over memory management, which can be crucial for large-scale data processing 👍
Python offers less control compared to C++ 👎

💫 Rapid Prototyping

C++'s compiled nature makes it a lackluster 👎
Python’s interpreted nature combined with Project Jupyter makes it a perfect match for the job 👍