ml-monitoring
Notes from the Coursera [[ml-ops]] course
Idea is to find metrics/statistics to monitor in real time that indicate something going wrong in the pipeline. Automate via setting thresholds to flag issues.
Start with redundant metrics, and gradually whittle down to necessary ones. #idea would be to also train a random forest to find important metrics. These provide feedback in the loop of [[ml-deployment]] to traffic to performance analysis. By analogy, ML goes from data to experiment to analysis.
Examples of metrics
Software: memory, compute, latency, throughput, server load
Input: Average input length, noise RMS, number of missing values
Output: Number of NaNs, number of times user redoes search,
The types of metric depend on the task: user data has more inertia than say trading data.