TII’s internal benchmarks (included as benchmarks/inference_results.csv ) show Falcon 40B achieves 42 tokens/second on a single A100-80GB when using 4-bit quantization—fast enough for real-time chat applications.
Falcon does not using learned positional embeddings (like GPT-2) or ALiBi.
Falcon 40’s performance hinges on a design: