I made a post in the Spaces section before the Show and Tell section existed. Happy to be the first to post here hahaha.
I wanted to share my first hugging face space with the community! Quick background on me: I’m learning data science and machine learning on my own after being in physics research for the past couple of years. I’m kinda doing a self-directed masters degree in machine learning, reading books on my own and stuff. This is my first project where I’ve actually deployed a model for people to see rather than a jupyter notebook on my github. I’d appreciate any feedback if anyone has something to say!
Anyway, I just completed a little project where I use historical air quality data in Chicago to train an xgboost classifier and predict the air quality index for the next three days on an hourly resolution. The data is from OpenWeather. I set up GitHub actions to automatically run my scripts to download new data and to retrain the model. My HuggingFace repo should update when those scripts run, but we shall see.
I actually teetered between using xgboost classifier and regressor. The target is on a scale of 1-5 and integral. I found that regression tended to predict closer to the mean of all the AQI’s. I suppose I could have mapped the range of the predictions to the target of 1-5. Using a classifier made things easier.
The model itself is not that great. I found that it didn’t perform much better than randomly guessing the air quality index (needed a log-loss better than 0.68, got 1.07 on the validation set on average). I suspect my feature engineering is pretty sub-par. I made some features like a few lagged air quality indices and the mean and standard deviation of the indices for a small window in the past, but that didn’t seem to help! Time series are no doubt more complicated that predicting a single value from a tabular data set haha.