The extension of propositional logic with quantifiers leads to the logic of quantified boolean formulas (QBF). From a theoretical point of view, the decision problem of QBF, called QSAT, is PSPACE-complete. Nevertheless, QBFs have been used successfully to encode and solve problems coming from several different fields. To be able to solve larger and more complex problem instances, a lot of effort was put into the development of efficient solvers. A recent trend is to improve QBF solvers by (deep) machine learning. One such approach, which was presented by Xu and Lieberherr, uses the famous AlphaZero framework to solve QSAT problems.
AlphaZero is an algorithm that combines symbolic reasoning techniques in form of a Monte Carlo tree search (MCTS) with novel deep reinforcement learning techniques, called self-play reinforcement learning, to learn fully-observable, symmetric 2-player games. It has proven to be powerful by beating the best human players and computer programs in the games of chess, shogi and Go.
The key to using AlphaZero for QSAT is to consider QBF solving as a 2-player game where one player tries to make the formula true by assigning variables to truth values while the other player in the same way tries to make the formula false. The main idea is then to use the AlphaZero framework to learn two (perfect) players for this game and to use these (perfect) players to determine the truth value of QBFs.
In this thesis, we reproduce and analyze the results communicated by Xu and Lieberherr. In particular, we train two AlphaZero players for the game of QBF solving and compare them to several different variants of a standard MCTS and to a state-of-the-art QBF solver. We describe how to adapt the standard AlphaZero algorithm to the task of QBF solving. Finally, we also discuss how MCTS can be used as a preprocessing tool before applying a standard QBF solver.