SheikhLM Architecture Specifications
This document outlines the architectural decisions and technical specifications for the SheikhLM family of models.
Design Philosophy
SheikhLM is designed for efficiency, speed, and deployment in resource-constrained environments. The architecture incorporates modern best practices from the Llama and Mistral families while maintaining a compact footprint.
Core Architectural Components
- Tokenizer: Byte-Pair Encoding (BPE) with a target vocabulary size of 32,000.
- Activation Function: SwiGLU (using SiLU/Swish). SwiGLU has been shown to outperform standard GELU in most benchmarks.
- Normalization: RMSNorm (Root Mean Square Layer Normalization) applied before each transformer block. RMSNorm is computationally more efficient than standard LayerNorm.
- Positional Embeddings: RoPE (Rotary Positional Embeddings). RoPE allows for better extrapolation to longer sequence lengths and is the current industry standard.
- Attention: Standard Multi-Head Attention (MHA) for all variants.
- Embeddings: Tied Embeddings (Weight tying between input and output embeddings) to reduce the total parameter count, which is particularly beneficial for smaller models.
Model Variants
| Feature |
SheikhLM-135M |
SheikhLM-360M |
SheikhLM-1.7B |
| Parameters |
~135M |
~360M |
~1.7B |
| Hidden Size |
768 |
1024 |
2048 |
| Layers |
12 |
24 |
24 |
| Attention Heads |
12 |
16 |
16 |
| Intermediate Size |
2944 |
3072 |
8384 |
| Vocab Size |
32,000 |
32,000 |
32,000 |
| Max Context |
2048 |
2048 |
2048 |
Parameter Calculation Verification
Parameters are calculated including embeddings, all transformer layers (Attention + MLP + Norms), and tied output head.
- SheikhLM-135M: 134,302,464 parameters
- SheikhLM-360M: 359,973,888 parameters
- SheikhLM-1.7B: 1,704,560,640 parameters