Bangladesh Premier League Analysis
AIM
The main goal of the project is to analyze the performance of the bangladesh players in their premier league and obtaining the top 5 players in all of them in different fields like bowling, batting, toss_winner, highest runner, man of the match, etc.
DATASET LINK
https://www.kaggle.com/abdunnoor11/bpl-data
MY NOTEBOOK LINK
https://colab.research.google.com/drive/1equud2jwKnmE1qbbTJLsi2BbjuA7B1Si?usp=sharing
DESCRIPTION
- What is the requirement of the project?
- This project aims to analyze player performance data from the Bangladesh Premier League (BPL) to classify players into categories such as best, good, average, and poor based on their performance.
-
The analysis provides valuable insights for players and coaches, highlighting who needs more training and who requires less, which can aid in strategic planning for future matches.
-
Why is it necessary?
- Analyzing player performance helps in understanding strengths and weaknesses, which can significantly reduce the chances of losing and increase the chances of winning future matches.
-
It aids in making informed decisions about team selection and match strategies.
-
How is it beneficial and used?
- For Players: Provides feedback on their performance, helping them to improve specific aspects of their game.
- For Coaches: Helps in identifying areas where players need improvement, which can be focused on during training sessions.
- For Team Management: Assists in strategic decision-making regarding player selection and match planning.
-
For Fans and Analysts: Offers insights into player performances and trends over the league, enhancing the understanding and enjoyment of the game.
-
How did you start approaching this project? (Initial thoughts and planning)
- Perform initial data exploration to understand the structure and contents of the dataset.
- To learn about the topic and searching the related content like
what is league
,About bangladesh league
,their players
and much more. -
Learn about the features in details by searching on the google or quora.
-
Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.).
- Articles on cricket analytics from websites such as ESPNcricinfo and Cricbuzz.
- https://www.linkedin.com/pulse/premier-league-202223-data-analysis-part-i-ayomide-aremu-cole-iwn4e/
- https://analyisport.com/insights/how-is-data-used-in-the-premier-league/
EXPLANATION
DETAILS OF THE DIFFERENT FEATURES
There are 3 different types of the datasets.
- Batsman Dataset
- Bowler Dataset
- BPL (Bangladesh Premier League) Dataset
There are 12 features in Batsman Dataset
Feature Name | Description |
---|---|
id | All matches unique id |
season | Season |
match_no | Number of matches |
date | Date of Play |
player_name | Player Name |
comment | How did the batsman get out? |
R | Batsman's run |
B | How many balls faced the batsman? |
M | How long their innings was in minutes? |
fours | Fours |
sixs | Sixes |
SR | Strike rate |
There are 12 features in Bowler Dataset
Feature Name | Description |
---|---|
id | All matches unique id |
season | Season |
match_no | Number of matches |
date | Date of Play |
player_name | Player Name |
O | Overs |
M | middle overs |
R | Runs |
W | Wickets |
ECON | The average number of runs they have conceded per over bowled |
WD | Wide balls |
NB | No balls |
There are 19 features in BPL Dataset
Feature Name | Description |
---|---|
id | All matches unique id |
season | Season |
match_no | Number of matches |
date | Date of Play |
team_1 | First Team |
team_1_score | First Team Score |
team_2 | Second Team |
team_2_score | Second Team Score |
player_of_match | Which team won the toss? |
toss_winner | Which team won the toss? |
toss_decision | Toss winner team decision |
winner | Match Winner |
venue | Venue |
city | City |
win_by_wickets | Win by wickets. |
win_by_runs | Win by runs |
result | Result of the winner |
umpire_1 | First Umpire Name |
umpire_2 | Second Umpire Name |
WHAT I HAVE DONE
- Performed Exploratory Data Analysis on data.
- Created data visualisations to understand the data in a better way.
- Found strong relationships between independent features and dependent feature using correlation.
- Handled missing values using strong correlations,dropping unnecessary ones.
- Used different Regression techniques like Linear Regression,Ridge Regression,Lasso Regression and deep neural networks to predict the dependent feature in most suitable manner.
- Compared various models and used best performance model to make predictions.
- Used Mean Squared Error and R2 Score for evaluating model's performance.
- Visualized best model's performance using matplotlib and seaborn library.
PROJECT TRADE-OFFS AND SOLUTIONS
- Trade-off 1: Handling missing and inconsistent data entries.
-
Solution:
- Data Imputation: For missing numerical values, I used techniques such as mean, median, or mode imputation based on the distribution of the data.
- Data Cleaning: For inconsistent entries, I standardized the data by removing duplicates, correcting typos, and ensuring uniform formatting.
- Dropping Irrelevant Data: In cases where the missing data was extensive and could not be reliably imputed, I decided to drop those rows/columns to maintain data integrity.
-
Trade-off 2: Extracting target variables from the dataset.
-
Solution:
- Feature Engineering: Created new features that could serve as target variables, such as aggregating player statistics to determine top performers.
- Domain Knowledge: Utilized cricket domain knowledge to identify relevant metrics (e.g., strike rate, economy rate) and used them to define target variables.
- Label Encoding: For categorical target variables (e.g., player categories like best, good, average, poor), I used label encoding techniques to convert them into numerical format for analysis.
-
Trade-off 3: Creating clear and informative visualizations that effectively communicate the findings.
-
Solution:
- Tool Selection: Used powerful visualization tools like Matplotlib and Seaborn in Python, which provide a wide range of customization options.
- Visualization Best Practices: Followed best practices such as using appropriate chart types (e.g., bar charts for categorical data, scatter plots for correlations), adding labels and titles, and ensuring readability.
- Iterative Refinement: Iteratively refined visualizations based on feedback and self-review to enhance clarity and informativeness.
-
Trade-off 4: Correctly interpreting the results to provide actionable insights.
- Solution:
- Cross-validation: Used cross-validation techniques to ensure the reliability and accuracy of the analysis results.
- Collaboration with Experts: Engaged with cricket experts and enthusiasts to validate the findings and gain additional perspectives.
- Contextual Understanding: Interpreted results within the context of the game, considering factors such as player roles, match conditions, and historical performance to provide meaningful and actionable insights.
LIBRARIES NEEDED
- matplotlib
- pandas
- sklearn
- seaborn
- numpy
- scipy
- xgboost
- Tensorflow
- Keras
SCREENSHOTS
MODELS USED AND THEIR ACCURACIES
Model | MSE | R2 |
---|---|---|
Random Forest Regression | 19.355984 | 0.371316 |
Gradient Boosting Regression | 19.420494 | 0.369221 |
XG Boost Regression | 21.349168 | 0.306577 |
Ridge Regression | 26.813981 | 0.129080 |
Linear Regression | 26.916888 | 0.125737 |
Deep Neural Network | 27.758216 | 0.098411 |
Decision Tree Regression | 29.044533 | 0.056631 |
MODELS COMPARISON GRAPHS
CONCLUSION
- Here we can see that R2 Score and Mean Absolute Error is best for Random Forest Regression.
- By Using Neural network, We cannot get the minimum Mean Squared Error value possible.
- Here, Random Forest Regression model can predict most accurate results for predicting bangladesh premier league winning team which is the highest model performance in comparison with other Models.
WHAT YOU HAVE LEARNED
- Insights gained from the data:
- Identified key performance indicators for players in the Bangladesh Premier League, such as top scorers, best bowlers, and players with the most man of the match awards.
- Discovered trends and patterns in player performances that could inform future strategies and training programs.
-
Gained a deeper understanding of the distribution of player performances across different matches and seasons.
-
Improvements in understanding machine learning concepts:
- Enhanced knowledge of data cleaning and preprocessing techniques to handle real-world datasets.
- Improved skills in exploratory data analysis (EDA) to extract meaningful insights from raw data.
- Learned how to use visualization tools to effectively communicate data-driven findings.
USE CASES OF THIS MODEL
- Application 1: Team Selection and Strategy Planning:
-
Explanation: Coaches and team managers can use the model to analyze player performance data and make informed decisions about team selection and match strategies. By identifying top performers and areas for improvement, the model can help optimize team composition and tactics for future matches.
-
Application 2: Player Performance Monitoring and Training:
- Explanation: The model can be used to track player performance over time and identify trends in their performance. This information can be used by coaches to tailor training programs to address specific weaknesses and enhance overall player development. By monitoring performance metrics, the model can help ensure that players are continuously improving.
HOW TO INTEGRATE THIS MODEL IN REAL WORLD
- Prepare the data pipeline
- Deploy the model using appropriate tools (e.g., Flask, Docker)
- Monitor and maintain the model in production
FEATURES PLANNED BUT NOT IMPLEMENTED
- Feature 1: Real-time Performance Tracking:
- Description: Implementing a real-time tracking system to update player performance metrics during live matches.
-
Reason it couldn't be implemented: Lack of access to live data streams and the complexity of integrating real-time data processing.
-
Feature 2: Advanced Predictive Analytics:
- Description: Using advanced machine learning algorithms to predict future player performances and match outcomes.
- Reason it couldn't be implemented: Constraints in computational resources and the need for more sophisticated modeling techniques that were beyond the current scope of the project.
YOUR NAME
Avdhesh Varshney