## About

I am a PhD student at Courant Institute of Mathematical Sciences, my advisor is Yann LeCun. Entering my 6th year, it's probably a good idea to recollect some pieces of my memory. Some recent news: I went to France to work with Stephane Mallat as a postdoc after my PhD, now I just came back to China as a research associate at Peking University (starting July 2018).## Research

NIPS 2011 Optimization Challenge This is my first attempt to study the optimization methods used to train Neural Networks. It was a time of stochastic gradient descent (SGD) method. We asked whether it's possible to incorporate the 2nd-order methods like Newton, (L)-BFGS to speedup the training. It turns out that the curvature information is very non-trivial to estimate, due to the stochastic nature of the gradients and its high dimensionality. We have also played with the Averaging SGD idea, which helps us to stabilize SGD. However, its impact is quite asymptotic. It is still not clear to me when to start the averaging at all, due the non-stationarity of the process and the non-convexity of the objective function. Also when it's coupled with the choice of the learning rates, it becomes quite a headache. What turns out to be quite effective in the simulation is the idea of the moving averaging. It's a simple idea that's always in my mind, and years later it motivates the Elastic Averaging SGD method. However, before going there, it's a nice detour to study the choice of the learning rates when Tom Schaul was around.No more pesky learning rates That's a big dream. People have tried very hard to close the control loop, i.e. to make the learning rates or gradient directions adaptive to the feedback. Very limited theory is known due to the complexity the Neural Networks models that we are training. Even our starting point, the one-dimensional quadratic function is quite hard to deal with. It'd be nice to understand better its stability property due to certain large deviations of the stochastic gradients. The dream is that if only the machine were to run forever.

Regularization of Neural Networks using DropConnect It was the time when Geoffrey Hinton proposed a new regularization method called Dropout. I was taking a neural science class that semester by Professor Charles S. Peskin. During the course, I learned that the synapse in the nervous system behaves in a random and quantized way. It's quite a puzzle to me why that happens. So I talked with Wan Li, who's at the time playing with the Dropout, that maybe it makes more sense to have the connections instead the neurons behave randomly. So we tried, and got fairly good empirical results. However, the regularization of neural networks, or in general the regularization of machine learning models is still an art to me. What the world would be if there's no ethics to regularize our own behavior? Philosophically, when we run the learning algorithm, the information seems getting lost over time. It's the introduction of the extra information in the form of the regularization that preserves the total amount of information of the world. Can we expect and formalize this form of information conservation, like the energy, in the physical world?

Deep learning with Elastic Averaging SGD The idea of the elastic averaging is an ideal. Technically, if the Markov process associated with the stochastic gradient descent method is ergodic, it makes sense to estimate the spatial averaging instead of the time averaging. So it's a hope that the deep learning problem would be 'simple' enough that we would not see different ergodic components. The name comes from a discussion with my advisor Yann, during which he proposed the word Elastic, and I said the word Averaging. However, the paper was rejected many times until Anna Choromanska came and questioned: why do you guys give a new name to this ADMM-variant? It took us another year to find an answer. In the paper, we theoretically study the stability of ADMM and EASGD methods in the round-robin case, and point out numerically quite invisible conditions when ADMM behave unstably. One illustrative example is available in matlab Recently, we opened the source code of the EASGD/EAMSGD methods on github. It is the authors'hope that the knowledge be shared freely, and that the research be further advanced.

## Study

I obtained my Bachelor in Mathematics from Shanghai University. However, it was a time when I was interested in Computer science. I had fun participating in the ACM programming contest, as well as the MCM mathematical modeling contest. My bachelor thesis turned out to be a study of face detection problem using AdaBoost method. Meanwhile, I was studying French and had planned a further study in France. I didn't remember exactly why I went to Paris VI for studying mathematics. It was really a touch year as a new arrival to a territory without knowing much of the language. Luckily, the language of mathematics is quite universal. In the second semester of the year, I had conducted a research study on the classical Maximum likelihood estimator, guided by Professor Lucien Birge. He introduced me the world of statistics, which established me a good foundation to further pursue my interest in Machine Learning. So in the second year, I went to ENS Cachan, where I had the chance to study machine learning in depth. However, I was still thinking of another option, which is the mathematical finance. So I had an internship in CREDIT AGRICOLE Cheuvreux, working with two researchers Charles-Albert Lehalle and Romain Burgot. Mathematically, we had fun modeling the news impact on the financial market using Kalman Filter. But I realize that I don't really understand the world of finance, and that my interest is still in machine learning. I was lucky to get an offer from NYU Courant Institue to further study this. Back then, I naively thought that machine learning problem is simply an optimization problem. So I proposed to my PhD advisor Yann to study various optimization methods as the main focus of my thesis. It's a nice focus. However, this time, being a computer science major, I started to get more interests in mathematics. It's my honor to be able to engage into the broad ranges of mathematical classes and seminars at Courant Institute. I also had the chance to study photography at Tisch with undergraduate students, and to explore its connection with mathematics, computer science and machine learning: the positive and negative side of the world, or maybe Yin and yang.## Mail Address

715 Broadway, Room 1208 New York, NY 10003, U.S.A.zsx@cims.nyu.edu