This page is a practical setup guide: how to choose and install
a sane Python environment for doing real work in Machine
Learning, Computer Vision, Sound Processing and Natural Language
Processing (AI / ML / CV / NLP). It is aimed at students,
researchers, engineers and practitioners, with open-source
tools that are widely used, well maintained and useful in
practice. It reflects what I rely on day to day; it is not an
exhaustive benchmark of every framework.
Two hardware paths matter today:
CUDA on
NVIDIA GPUs — the
reference setup for serious training. Research scientists can
apply for a free GPU through the
NVIDIA
Academic Grant Program.
Apple Silicon (M-series
chips) — much better than it used to be for local work.
PyTorch can use the
Metal Performance Shaders
backend through the mps
device, which makes prototyping and small experiments
pleasant on a laptop. Apple also develops
MLX,
a NumPy-like array framework with autodiff and unified-memory
tensors, designed for ML research on Apple Silicon (see also
mlx-examples).
These are useful, especially for local development; CUDA on
NVIDIA GPUs remains the reference for heavy training.
This page has been used in the
MAP5 lab
in Applied Mathematics to conduct research in AI, ML, CV and
NLP in Python. Please feel free to
contact Warith Harchaoui
for improvements and suggestions.
2. Programming in AI
Depending on the decade, the ML community has moved from
Java to
MATLAB to
Python. Trillion-dollar companies have chosen Python,
and the research community has followed with funding and tooling.
We recommend Python for your ML projects to match the scientific and
industrial trend — not as a personal taste but as a pragmatic default.
Teams on R or Java can still ship through
ONNX-R
and
DL4J;
Rust is gaining ground — see
"Are we learning yet?".
Recommended toolboxes
Ordered chronologically by the first release of the lead tool —
you can read the list as a short history of the field.
Computer Vision tools — libraries, models, annotation, hosted platform— since 2000.
Five complementary pieces. Pick the one that matches the
layer you are working at:
Classical CV —
OpenCV
(2000). Image processing, geometric transforms, camera
calibration, tracking. C++ core with Python bindings; still
the right answer for everything that does not need a
learned representation.
Object detection / instance segmentation —
Detectron2
(Meta FAIR, 2019). Modular PyTorch toolkit for detection,
segmentation, keypoint and panoptic tasks. Mature recipes,
model zoo, and a straightforward training loop — the
default serious-CV stack when you want full control over
the architecture and training recipe. For a faster
practical path, the
Ultralytics
YOLO line
(YOLOv8,
YOLO11, YOLO12 — Python package
pip install ultralytics)
is the dominant ready-to-use detection / segmentation /
pose / oriented-bounding-box stack; AGPL-3.0 with a
commercial license for closed-source use.
Annotation platform —
CVAT
(open source, MIT) is the open-source reference for
labelling image and video datasets — classification,
detection, tempral labelling, segmentation, keypoints, OCR, for free regarding the Community version. With
an AI coding
agent, you get much more empowered for sophisticated annotations without handling horrible plumbing
yourself.
Hosted CV platform + model hub —
Roboflow
(2020). Today's Roboflow is end-to-end: train, deploy and
serve CV models, plus
Roboflow Universe
as a public hub of pre-trained models and datasets. They
still ship Roboflow Annotate, but in 2026 their headline
value is the training / deployment / inference stack, not
annotation alone.
Self-supervised feature backbones —
DINOv3
(Meta FAIR, 2025). Strong general-purpose image embeddings
trained without labels; use as a frozen backbone for
downstream tasks (classification, retrieval, segmentation)
when you have few labels or want a robust feature space
without retraining.
Classical ML— since 2007 —
scikit-learn.
The canonical Python library for classification, regression,
clustering and model selection. Stable API, good defaults,
excellent docs — start here unless your problem clearly needs
deep learning.
Deep learning— since 2015 —
today I would usually start with
PyTorch
(2016): widely used in research and practice, large ecosystem,
most new architectures land here first.
TensorFlow
(2015) is still important when you need its ecosystem, its
production tooling or its mobile / edge deployment path
(TF Lite / LiteRT).
Higher-level deep-learning wrappers— since 2015 —
Keras 3
(2015; now multi-backend — runs on TensorFlow, JAX or PyTorch,
with OpenVINO for inference) is the high-level API to consider
when you want one Python interface across backends.
PyTorch Lightning
(2019) removes the training-loop boilerplate and gives you
distributed training and checkpointing for free.
NLP + LLM tools — classical NLP, transformer models, local inference, retrieval, evaluation— since 2015.
Five complementary pieces. Pick the one that matches the layer
you are working at:
Classical NLP + word embeddings —
spaCy
(2015) for industrial-strength tokenization, NER,
dependency parsing and POS tagging.
fastText
(Meta, 2016) for very fast word embeddings and
multilingual text classification — C++ core, Python
bindings, still a strong cheap baseline against
transformer models.
Transformer models + embeddings —
Hugging Face Transformers
(2018) for pre-trained encoders, decoders and seq2seq;
sentence-transformers
(2019) for dense semantic embeddings and retrieval. See
also the standalone Hugging Face bullet below.
Local LLM inference —
Ollama
(2023) is the easiest local-first daemon — it wraps
llama.cpp
for quantised inference on CPU, Apple Silicon (MPS / MLX)
or CUDA. For production-scale serving with paged
attention, continuous batching and OpenAI-API
compatibility, use
vLLM
(2023).
Retrieval + RAG orchestration —
LangChain
or
LlamaIndex
(both 2022) to compose retrievers, prompts and tool
calls. Pair with a vector store —
FAISS
(Meta) for the canonical in-process ANN index, or
turbovec
(2026; Rust core with Python bindings, built on Google
Research's TurboQuant) when memory matters — it
compresses float32 embeddings to 2- or 4-bit and fits
~10 M vectors in ~4 GB while keeping competitive recall,
with no separate training step. My preferred
ANN index for local RAG.Qdrant
or
Chroma
when you need a separate service.
LLM evaluation —
DeepEval
(2023) for assertion-style tests with LLM-as-judge in a
Pytest-shaped runner;
Ragas
(2023) for RAG-specific metrics (context relevance,
faithfulness, answer correctness).
Fairness + explainability + causality — bias metrics, local explanations, counterfactuals, causal effects, audit— since 2016.
Six complementary pieces. Run them before the model
ships, not after a regulator asks:
Fairness metrics + mitigation —
Fairlearn
(Microsoft, 2018) ships demographic parity, equalized
odds and a catalogue of mitigation algorithms
(post-processing, reductions).
Local explanations —
SHAP
(2017) unifies Shapley-value feature attributions across
model families and is the de-facto answer for tabular
and tree models.
Shapash
(MAIF, 2020) sits on top of SHAP and LIME with a
business-readable layer — feature labels, a built-in
webapp for non-ML stakeholders, and one-call export of
individual-prediction reports. Pick it when the
audience for the explanation is not the ML team.
LIME
(2016) is the older perturbation-based approach — still
useful as a sanity check.
Counterfactual explanations —
FACET
(BCG X, 2020) for supervised-learning explainability
plus counterfactual simulation on top of scikit-learn.
DiCE
(Microsoft, 2020) generates diverse counterfactuals —
"what minimal change would flip this prediction?"
Bias auditing pipeline —
Aequitas
(CMU DSSG, 2018) wraps the metric zoo into an end-to-end
audit with reports per protected group — useful when you
need to show your work to a non-ML reviewer.
Deep-net attribution + interactive what-if —
Captum
(Meta / PyTorch, 2019) for integrated gradients and other
attribution methods on neural nets.
What-If Tool
(Google PAIR, 2018) is the TensorBoard / Colab plugin for
tweaking inputs interactively.
Causal inference —
PyWhy
(open-source ecosystem, 2022; spun out of Microsoft
Research). Its core library
DoWhy
ships a four-step pipeline (model → identify → estimate →
refute) for answering "does X cause Y?" rather than "can
we predict Y from X?".
EconML
handles heterogeneous treatment effects when the answer
depends on the subgroup. Causal counterfactuals are the
sibling of the FACET / DiCE counterfactuals above —
instead of "what input change would flip the prediction?"
they ask "what intervention would change the outcome?"
Tabular-data preprocessing— since 2017 —
skrub
(originally dirty_cat,
renamed in 2024; from the
Probabl /
Inria scikit-learn family). Turns messy real-world tables into
scikit-learn-ready features: fuzzy joins on dirty string keys,
high-cardinality categorical encoders
(GapEncoder,
MinHashEncoder,
TableVectorizer),
and a one-call tabular_learner()
baseline. Use it whenever your CSV needs cleaning before it can
reach a scikit-learn pipeline.
Pre-trained models for CV / NLP / Speech— since 2018 —
Hugging Face.
Model hub plus the transformers,
diffusers
and datasets
libraries — the de-facto distribution layer for open-weights
models. A one-line
from_pretrained()
is the modern equivalent of pip install.
AI testing tools — model scans, LLM unit tests, RAG eval, red-team, data validation— since 2022.
Five complementary pieces. Testing an AI system needs more
than a held-out accuracy number — pick the layer that
matches the risk you actually run:
ML model scans —
Giskard
(2022; open source on
GitHub
plus a hosted hub) scans tabular, NLP and LLM models for
robustness, performance issues, hallucination, prompt
injection and bias — the closest thing to a one-call
"what's wrong with this model?" check.
LLM unit testing —
DeepEval
(2023, open source on
GitHub)
wraps LLM-as-judge metrics in a Pytest-shaped runner —
assertion-style tests for hallucination, answer
relevancy, summarisation quality. Drop alongside your
regular test suite.
RAG-specific metrics —
Ragas
(2023) for context relevance, faithfulness and answer
correctness on retrieval-augmented pipelines. Pair with
DeepEval when the system uses both retrieval and free
generation.
LLM red-team + safety —
PyRIT
(Microsoft, 2024) is the Python Risk Identification Tool
for automated red-teaming against generative AI.
garak
(NVIDIA, 2023) is the LLM vulnerability scanner — prompt
injection, jailbreaks, data leakage, hallucination
probes.
Data + schema validation —
Great Expectations
(2017) and
Pandera
(2019) check the data before it reaches the
model. You cannot test a model whose inputs you cannot
audit — these are the upstream fence.
AI-first code editors— since 2023 —
Cursor
(Anysphere, 2023) is the original AI-first VS Code fork; deep
inline completions, an agent mode that runs commands, and the
editor most ML teams compare everything else against.
Google Antigravity
(2025) is Google's agent-first VS Code fork — up to five
parallel agents in a Manager view, multi-model (Gemini 3,
Claude Opus 4.6, Claude Sonnet, GPT-OSS-120B), with a built-in
Chrome view for visual verification of UI work. Open-source
equivalent:
Cline
(Apache-2 VS Code extension; bring-your-own-key for any
provider, including local Ollama; the most widely used OSS
agentic-coding option and top of the open-source SWE-bench
leaderboard).
Reactive Python notebooks— since 2023 —
Marimo.
A reactive notebook stored as a plain
.py
file (diff-friendly, no JSON, version-control friendly). Cells
form a dataflow graph: change a value upstream and every
downstream cell re-runs automatically — no hidden state, no
"out-of-order Jupyter" debugging sessions. Built-in UI widgets
(mo.ui.slider,
mo.ui.dropdown,
mo.ui.table)
turn a notebook into a small reactive app with
marimo run notebook.py.
Use it as a modern replacement for Jupyter when you want
reproducibility, code review and lightweight ML demos in the
same artefact.
AI coding agents— since 2024 —
Claude Code
(Anthropic's official CLI agent — reads your repo, runs
commands, edits files, follows project-specific skills under
~/.claude/skills/)
and
OpenCode
(open-source terminal agent that runs Claude, GPT or local
models behind the same UX, with the same skill format under
~/.opencode/skills/).
See
section 4
below for why this matters.
Machine Learning project tracking— since 2024 —
Skore
(by Probabl,
the scikit-learn company). Sits on top of scikit-learn to give
you opinionated evaluation, model comparison and a project-level
dashboard. Use it the moment you have more than one model worth
comparing.
Practical Python utility libraries the author maintains on GitHub —
the full index lives at
harchaoui.org/warith/ai-helpers.
Each one wraps a specific corner of the AI / media stack so you
do not re-implement the same plumbing in every project.
os-helper —
utility functions for working across operating systems (paths,
environments, shell quirks). The dependency-less base layer
most of the other helpers rely on.
audio-helper —
loading audio, converting formats, separating sources,
splitting / trimming. Wraps the messy parts of ffmpeg, librosa
and friends behind a small Python API.
video-helper —
load, convert and frame-extract video files; work with subtitle
formats. The video-side counterpart to audio-helper, sharing
conventions and CLI shape.
yt-helper —
download videos, audio and thumbnails from YouTube, Vimeo,
Dailymotion (via yt-dlp).
The "give me a clip" layer above the raw downloader.
sftp-helper —
utility functions for SFTP servers (upload, download, walk,
mirror). Useful when your dataset or model artefacts live on
an SFTP share rather than S3.
Model serving and deployment
Model serving hosts ML models (cloud or on-premises) and exposes
them via an API so that applications can integrate AI into their
flow. For cross-language compatibility,
ONNX
is both a model file format and a deployment framework, backed
by open-source contributions plus Microsoft and the Linux
Foundation AI. Prototype in Python, deploy via ONNX format and
ONNX Runtime for C / C++, C#, Java, JavaScript and Objective-C.
For performance-critical code, the well-trodden path is still
C / C++ called from a higher-level language. For Python, I have
the best experience with
pybind11
(Meta used it for
fastText).
Cython
is respected thanks to scikit-learn;
Numba
is a pleasant just-in-time alternative for hot loops. Apple's
coremltools
let you tap the Neural Engine from Python at inference time.
Historical note.
Older frameworks and GPU abstraction layers such as
Apache MXNet,
PlaidML and DeepCL played a role in the evolution of deep
learning tooling, but I would not recommend starting a new
project with them today — MXNet in particular has been retired
into the Apache Attic.
3. Installation
For optimal performance, we recommend Ubuntu with NVIDIA GPU acceleration,
optionally controlled remotely from a Mac. For prototyping on small
datasets and for iteration, we recommend
Google Antigravity
for programming on any platform. The environment name is env4ml;
Python version is 3.12
(PyTorch and most maintained packages now support 3.12 by default).
NVIDIA driver and CUDA — follow the
Google Cloud instructions
on your local machine. Mandatory for NVIDIA acceleration.
Activate the environment for each session:
ENV=env4ml
conda activate $ENV
macOS (Sonoma 14 or higher, Apple Silicon)
On Apple Silicon Macs the situation is much better than it
used to be. PyTorch can use Apple's
Metal
Performance Shaders
backend through the
mps
device, which makes local experimentation and prototyping
pleasant. Apple also develops
MLX,
a NumPy-like array framework with autodiff designed for ML
research on Apple Silicon. These are useful, especially for
local work; for heavy training, CUDA on NVIDIA GPUs is still
the reference.
Jupyter —
local in-browser environment for demos, hands-on workshops and
quick prototyping. pip install jupyter, then jupyter notebook.
Google Colab —
Jupyter-style, running on Google's GPUs / TPUs.
Streamlit,
Gradio
and
Taipy
— Python-first auto-form generators for ML demos and small
data apps. Excellent for the 80% case; opinionated about
visual style. If you need the result to look like
your app instead of a Streamlit / Gradio / Taipy
app, see
section 5
below.
4. AI coding agents — a new abstraction level
Programming has always moved by adding abstraction layers. Machine code
gave way to assembly. Assembly gave way to C. C gave way to Python,
JavaScript, Rust. Each new level let programmers express more intent
and delegate more mechanism to a tool: the compiler.
Large language models add the next layer — natural language as the
source artifact, with code as a compiled output.
“The hottest new programming language is English.”
That is the optimist's framing. The same observation, from the
other side of the abstraction stack, comes from the creator of
Linux:
“AI is a great new tool, but it's a tool, and when I see people saying,
‘Hey, 99% of our code is written by AI,’ I literally get angry, because
those same people — I can pretty much guarantee — that 100% of their
code is written by compilers.”
Both quotes are right at the same time. Karpathy is naming the new
floor — most people will reach a useful program by writing English
instead of Python. Torvalds is naming the ceiling — the engineer who
understands what the compiler (or the model) produced is still the one
who can debug, optimize and ship it. For an ML student today, the
practical synthesis is: treat AI coding agents as
the next compiler, write your intent clearly, and read the
output critically.
Anthropic's official CLI agent. Reads your repo, runs commands,
edits files and follows project-specific skills declared in
~/.claude/skills/.
Strong at code review, refactors, end-to-end debugging, and
multi-file edits.
Open-source terminal coding agent that runs Claude, GPT or local
models behind the same UX. Reads the same skill format from
~/.opencode/skills/.
Pick this when you want to avoid vendor lock-in or use a local
model for sensitive code.
5. Front — vanilla JS + Tailwind, with a CLI → GUI flagship
Most ML students and small lab teams hit the same wall: you have a
working Python CLI, your supervisor wants a web UI to demo it, and
nobody on the team wants to learn React. The default answers
(Gradio,
Streamlit,
Taipy,
Tauri)
each work for a slice of the problem but force their look and their
runtime on the result.
Front
is an open-source Claude / OpenCode skill that
constrains the agent to one frontend stack — vanilla JavaScript,
Tailwind CSS, Montserrat (or Inter) — and gives it a curated design
system. Asking the agent to "wrap this CLI in a GUI" produces a
single-page index.html
+ app.js
+ Tailwind config that maps each argparse / click flag to the right
form control and streams the CLI's output to a log panel. No framework
lock-in, no Python runtime, no "Gradio look" — plain HTML you can edit.
Your CLI keeps owning execution. Front emits the UI shell
and a tiny HTTP+SSE adapter to your existing CLI. You don't rewrite your
training loop, your inference script, or your evaluation harness.
The output looks like your project, not the toolkit's.
Streamlit and Gradio apps all look alike because their CSS is hard to
override cleanly. Front emits plain Tailwind that you read, audit and
tweak.
It composes with the rest of your stack.
Drop the emitted HTML into Tauri
for a desktop app, into FastAPI
for a hosted demo, or behind any reverse proxy.
Built-in accessibility and i18n. Focus rings,
dark-mode peers, reduced-motion guards, alt-text drafting and a
contrast audit are part of the skill, not a follow-up sprint.
6. Conclusion
This page outlines installation procedures for Artificial Intelligence
development in Python and explains where AI coding agents and front-end
skills fit on top of the stack. We recommend beginners explore the
free Kaggle courses
for hands-on practice.
Our choice of conda + pip is pragmatic:
conda manages isolated environments and Python versions.
pip covers the broader Python package universe.
Create the environment with conda, install with both as needed. To
export and recreate: