Join Cohere's Inference team to build the systems that serve our language models to enterprise customers at scale. You'll work on optimizing model serving latency and throughput, building efficient batching systems, and developing the infrastructure that powers Cohere's API.
You'll work with cutting-edge hardware (H100s, A100s) and software (TensorRT, vLLM, custom CUDA kernels) to push the boundaries of inference efficiency.