- About the Project
- Results
- Tech Stack
- File Structure
- Dataset
- Model Architecture
- Installation and Setup
- Future Scope
- Acknowledgements
- Contributors
This project focuses on developing a sophisticated lip-reading system that interprets spoken words from sequences of images. Using Haar Cascade classifiers for face extraction and dlib’s facial landmark detection for lip extraction, we effectively preprocess the data. A train-test split ensures robust model evaluation. The core of the project is a hybrid model combining 3D CNNs, which capture spatial features, and LSTMs, which understand temporal dynamics. Extensive hyperparameter tuning enhances the model’s accuracy. The system has been tested on online videos for accuracy and reliability and includes a live detection feature to showcase real-time capabilities.
lip_reading.mp4
![]() |
![]() | ![]() |
![]() |
![]() | ![]() |
![]() |
![]() |
For Words | For Phrases |

Category | Technologies |
---|---|
Programming Languages | |
Frameworks | |
Libraries | |
Deep Learning Models | |
Dataset | |
Tools | |
Visualization & Analysis |
├── Dataset Preprocessing
├── xml files
├── haarcascade_frontalface_default.xml
├── haarcascade_mcs_mouth.xml
├── shape_predictor_68_face_landmarks.dat
├── 01_Face_Extraction.ipynb
├── 02_Lip_Extraction.ipynb
├── 03_Train_Test_Split.ipynb
├── Hyperparameter Tuning
├── Grid Search.ipynb
├── Random Search.ipynb
├── Mini Projects
├── Cat_Dog_Classifier_CNN.ipynb
├── Human_Action_Recognition_LSTM.ipynb
├── Next_Word_Predictor_LSTM.ipynb
├── Video_Anomaly_Detection_CNN_LSTM.ipynb
├── Model Architecture
├── Saved Model
├── 3D_CNN_Bi-LSTM.h5
├── 3D_CNN.ipynb
├── 3D_CNN_Bi-LSTM.ipynb
├── 3D_CNN_From_Scratch.ipynb
├── 3D_CNN_LSTM.ipynb
├── Adam.ipynb
├── CategoricalCrossentropy.ipynb
├── Data_Augmentation.ipynb
├── Dropout.ipynb
├── EarlyStopping.ipynb
├── L1_Regularization.ipynb
├── L2_Regularization_1.ipynb
├── L2_Regularization_2.ipynb
├── RMSprop.ipynb
├── Model Evaluation
├── Accuracy.ipynb
├── Live_Detection.ipynb
├── Onlne_Testing.ipynb
├── Precision.ipynb
├── Recall.ipynb
├── Notes
├── LSTM
├── OpenCV
├── Om Mukherjee
├── Sourish Phate
├── README.md
The MIRACL-VC1 dataset is structured to facilitate research in visual speech recognition, particularly lip reading. Here's a breakdown of its structure and contents:
- Video Clips: The dataset contains short video clips of multiple speakers reciting specific phrases. Each clip captures the upper body, focusing mainly on the face and mouth area.
- Speakers: It features several speakers from diverse backgrounds, which helps models generalize across different individuals and speaking styles.
- Languages: The dataset is typically in English, though speakers may vary in accents and pronunciations.
- Phrases: Each video clip corresponds to one of a predefined set of phrases, which are recited by the speakers. The phrases are usually short and may cover simple daily expressions or numbers.
Download the MIRACL-VC1 dataset on Kaggle
-
3D Convolutional Neural Network (3D CNN): Several convolutional layers are used, each followed by activation functions and pooling layers to reduce dimensionality while preserving essential features.
-
Reshape Layer: The tensor dimensions are adjusted to flatten the spatial data into a format that the LSTM can process.
-
Long Short-Term Memory (LSTM): One or more LSTM layers are employed to process the sequential data, enabling the model to retain information over time and improve prediction accuracy.
-
Flatten Layer: This flattens the data without altering its values, preparing it for the next stage.
-
Dropout Layer: A dropout rate is set (e.g., 0.5) to control the fraction of neurons dropped this prevents overfitting.
-
Dense Layers: One or more dense layers with activation functions (e.g., softmax for multi-class classification) are used to output the prediction probabilities.
By combining these components, the model effectively learns to interpret lip movements, translating them into accurate predictions of spoken words.
Follow these steps to set up the project environment and install necessary dependencies.
Ensure you have the following software installed:
Clone the project repository from GitHub:
git clone https://github.com/sourishphate/Project-X-Lip-Reading.git
cd Project-X-Lip-Reading
Install the required Python packages:
pip install -r requirements.txt
Start the live detection application using the following command:
python '.\Model Evaluation\Live_detection.py
If you encounter issues or want to suggest any improvements raise an issue on GitHub.
-
Multilingual Model: Extend the current lip-reading model to support multiple languages, making it adaptable for a global audience and capable of handling diverse linguistic inputs.
-
User Interface Development: Design a user-friendly interface that allows real-time interaction with the lip-reading model, improving accessibility and practical usability.
-
Sentence-Level Lip Reading: Upgrade the model to read and interpret entire sentences, moving beyond word-level predictions to understand more complex speech patterns.
-
Large-Scale Model with Bigger Datasets: Transition to a large-scale model by training with much larger datasets, which will boost the model’s ability to generalize across various lip movements, leading to greater accuracy.
We would like to express our gratitude to all the tools and courses which helped in successful completion of this project.
Research Papers
- https://cs229.stanford.edu/proj2019aut/data/assignment_308832_raw/26646023.pdf
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://towardsdatascience.com/automated-lip-reading-simplified-c01789469dd8
Courses
A special thanks to our project mentor Veeransh Shah and to the entire Project X community for unwavering support and guidance throughout this journey.