SheikhLM

SheikhLM Architecture Specifications

This document outlines the architectural decisions and technical specifications for the SheikhLM family of models.

Design Philosophy

SheikhLM is designed for efficiency, speed, and deployment in resource-constrained environments. The architecture incorporates modern best practices from the Llama and Mistral families while maintaining a compact footprint.

Core Architectural Components

Tokenizer: Byte-Pair Encoding (BPE) with a target vocabulary size of 32,000.
Activation Function: SwiGLU (using SiLU/Swish). SwiGLU has been shown to outperform standard GELU in most benchmarks.
Normalization: RMSNorm (Root Mean Square Layer Normalization) applied before each transformer block. RMSNorm is computationally more efficient than standard LayerNorm.
Positional Embeddings: RoPE (Rotary Positional Embeddings). RoPE allows for better extrapolation to longer sequence lengths and is the current industry standard.
Attention: Standard Multi-Head Attention (MHA) for all variants.
Embeddings: Tied Embeddings (Weight tying between input and output embeddings) to reduce the total parameter count, which is particularly beneficial for smaller models.

Model Variants

Feature	SheikhLM-135M	SheikhLM-360M	SheikhLM-1.7B
Parameters	~135M	~360M	~1.7B
Hidden Size	768	1024	2048
Layers	12	24	24
Attention Heads	12	16	16
Intermediate Size	2944	3072	8384
Vocab Size	32,000	32,000	32,000
Max Context	2048	2048	2048

Parameter Calculation Verification

Parameters are calculated including embeddings, all transformer layers (Attention + MLP + Norms), and tied output head.

SheikhLM-135M: 134,302,464 parameters
SheikhLM-360M: 359,973,888 parameters
SheikhLM-1.7B: 1,704,560,640 parameters