The Colossal-AI system is designed for fast training and inference of AI models on diverse hardware. We aim to minimize the gap between the fast-growing model sizes and limited hardware capacity. For efficient memory management, we support heterogeneous training that facilitates CPU and NVMe offloading. To save activation memory during training, we implement activation checkpointing strategies to recompute some inexpensive activations during the backward pass. With those, we can successfully pre-train a 3 billion parameter (3B) transformer-based model on 4 A100 40GB GPUs, and an 8B model on 4 A100 80GB GPUs, which are 5.9x and 10.3x model sizes that are otherwise supported by not using our strategies. For optimized performances on both speed and memory savings, we have N-dimensional parallelism together with ZeRO redundancy optimizer and mixed precision training. The N-dimensional parallelism includes tensor, pipeline, sequence, and data parallelism. Those parallelism strategies are carefully designed and able to be integrated to speed up model training, overcome the memory bottleneck, and increase model performance. When combined together, we are able to use longer sequences as inputs and we achieve up to 7.73 times faster for single server training and 1.42 times faster for single-GPU inference. For more recent large language models such as LLaMA, we can get a 38% speedup in training compared to other state-of-the-art deep learning systems. We are also outstanding with large-scale model inference using dynamic axial parallelism and other techniques. With Colossal-AI, you can predict the 3D structure from DNA sequences lengthening 2-3K with a higher inference time of up to 11.6x. More information about Colossal-AI is available at https://github.com/hpcaitech/ColossalAI.