{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Scale Scikit-Learn for Small Data Problems\n", "==========================================\n", "\n", "This example demonstrates how Dask can scale scikit-learn to a cluster of machines for a CPU-bound problem.\n", "We'll fit a large model, a grid-search over many hyper-parameters, on a small dataset.\n", "\n", "This video talks demonstrates the same example on a larger cluster." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:22:40.954216Z", "iopub.status.busy": "2022-07-27T19:22:40.953750Z", "iopub.status.idle": "2022-07-27T19:22:41.018303Z", "shell.execute_reply": "2022-07-27T19:22:41.017611Z" } }, "outputs": [ { "data": { "image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEABALDBoYFhoaGRoeHRsfIi0lIiAiHzYlJyctMDAxMi0oLS01PVBCNThLOSstRWFFS1NWW1xbMkFlbWRYbFBZW1cBERISGRYZLxsbMFc/NT1XV1dXV1dXV1dXV1dXV1dXV1dXV1dXV1dXV1dXV1dXV1dXV11XV1dXV1dXV11XV1dXV//AABEIAWgB4AMBIgACEQEDEQH/xAAbAAEAAgMBAQAAAAAAAAAAAAAAAwQBAgUHBv/EAEwQAAEDAQQDCwgHBwMDBQEAAAEAAhEDBBIhMQVBURMUIjJSU2FxkZLSBhczcoGhsdEVFmKjssHwI0JUc5Oi4WOC8QdDZDREg5TCJP/EABoBAQEBAQEBAQAAAAAAAAAAAAABAgMEBQb/xAAwEQEAAgECAwYDCQEBAAAAAAAAARECAxIhMVEEEzJBkaEUUvAFFSJhcYHB0eGxI//aAAwDAQACEQMRAD8A8/REQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERAREQEREBERARfS1fIm0MfTpurWcPqEhgvPxgSf3FrV8jK7CQ6vZwR0v6PsdI7VaS3ziL6an5EWhzS5tazlowJDn+DpCN8h7SWF4rWe6DE3n59xKLfMovp2eQ9pc0vFazlozN5/gSl5DWl83a1nMZ8J+H9iUW+YRfR/Uyvz9n7z/ApKvkPaWRerWcTlwn+BKLfMIvpPqZW/iLN3n+BSO8hbSGhxrWe6cjefH4Eot8ui+nb5DWktLhWs5a3M3n4f2KP6mVv4izDrc/wJRb5xF9PW8hrSyL9azicuE/wLdnkBa3AEVbOQcjed4Eot8qi+s831s5yz953hTzfWznLP3neFKLfJovrPN9bOcs/ed4E831s5yz953gSi3yaL6zzfWznLP3neBPN9bOcs/ed4Eot8mi+s831s5yz953hTzfWznLP3neBKLfJovrPN9bOcs/ed4E831s5yz953gSi3yaL6zzfWznLP3neBPN9bOcs/ed4Uot8mi+s831s5yz953hTzfWznLP3neFKLfJovrPN9bOcs/ed4E831s5yz953gSlt8mi+s831s5yz953gTzfWznLP3neBKHyaL6zze2znLP3neBPN7bOcs/ed4FKHyaL6zze2znLP3neBPN7bOcs/ed4Eot8mi+s83ts5yz993gTze2znLP3neBB8mi+s83ts5yz953gTze2znLP3neBB8mi+ld5FWgNc41rPDHBruE/AkgcjpQeRVcmmN3swNUSwFzgSOq4g+aRfU/UK1cL9rZ+Dnwn+Bat8hrS5rHCrQIfkbz/Ag+YRfTM8iLQ7KtZ8pzf4Fu3yCtRMCrZ59Z/g6ELfLIvorJ5G16xqBlWgdzMOkvG0cj7JUrvIa0hrXGtZwHG6OE/M6uIlD5hF9U/yBtbSAalDH7TvAtH+Q1pa666tZwdl5+v/AGJQ+YRfSfUyt/EWXvv8C2Z5E13ODRaLMXHIX3+BKHzKL6ceQtpgu3WzwCBxn68v3OlHeQ1qD7hqUJ9Z8ZTnc2JQ+YRfSv8AIm0ht7daBExgXnHuLb6jWndNz3az3w29F5+XXcQfMIvo3+RtdtNtQ17OGOMAy/p+x0FS/US1cH9rZ+EARw3Yz/tQfLovqh5AWu9d3Szz6zvCst/6f2s5VbPh9p/gQaUbXbHXS+tbBwhJAeYbrISra7YBwK1tJvkQQ8cHUetek2G9udLhCDSbhGMxnPYrJDuFBicsMlK48yeL4vyNdaKz7QLQ6u4Bjbl9zxjJmJI6F9Q+y8lhIjM1nD3SVLVZVFMBrxfvYl2yThgOrVqVlauUqFOnZMeE0gRqrPOOzNSbyZ9v+o/5qwRgonB3CDSAcI4OSXJUNN5U/tf1HfNc7T1BzLJaXUt0D2slhbUeXT1SugKdWeM2I5GuPmlyreBDmxsLOj5pclQ8xfbbfPBqWyIGYfnrWu/tIc5bP716nFTazuH5rB3XUWd0/NRXl2/tIc5bP71jf2kOctn3i9TO6bWR6pSKnKbPqH5oPLN/aQ5y2feJv7SHOWz7xepgVNre4fmpmnDHE9SDybf2kOctn3ib+0hzls+8XrV4bPcl4bPcg8l39pDnLZ94m/tIc5bPvF61eGz3JeGz3IPJd/aQ5y2feJv7SHOWz7xesPMjDA7YUVFrweG4OHqwg8s39pDnLZ94m/tIc5bPvF6xOPRsu/mo6IeG8IydoCDyvf2kOctn3ib+0hzls+8XqsOgAETtLc1swn96D1NhB5Rv7SHOWz7xN/aQ5y2feL1q8NnuS8NnuQeS7+0hzls+8Tf2kOctn3i9ZnHBZlUeS7+0hzls+8Vija7Yaby6tbQ8cUAPIPWV6bVa8lt10DX7llwfeaQ6Gibw27FrHLbx5kTTzOzWu0OFPdLTb2kvIqXWvN1sYEbTK0rWu1hjCyvbi8l19pDwAAeDB6QvRLG4yROv885VyHQcerDJYHyPkca9Zlc2h1dxBbdvueNsxJC+j3qb0XTd27u+eyVZcHQ0BwmcZ19SkVuSoUqdlMi8wgayK7z7lNvNn2/6r/mpyFC9riHBrgHThLZgJcpUMbzZ9v8Aqv8AmuP5TU6lOxVn0TVFQFlwsqPLoLmzhOyV17lSTwmx6mSxcqg4ObGws+SXJUPMTbtIThUtkf8AyLG/9Ic5bPvF6eBW1up/0z80O77aWv8AcPs1qK8vNtt+Iv2uDmOHj1pv7SHLtf8AevUDu8504jkHPtRwranUx03D80HmG/tIcu2feJv7SHOWz7xen3auPDaccJZqUlG+OOWkR+60jFB5Zv8A0hzls+8Tf+kOctn3i9ZvDZ7kvDZ7kHkjbXb2zddahOcB4nrwW2/tIc5bPvF6zI2e5Lw2e5B5MbdpDl2v+9N/aQ5y2feL1apiODgdsLSi14m+69s4MIPK9+2/O/a5/wB6b90hnftf969XM4wRlhwclpTDrsOMnbHX/hB5Xv3SHLtf96b8t8zetc7YfPavVDfgQR0y3sUjCcb0HqbCDyjf+kOctn3ixv23zN+1zt4cr1q8NnuS8NnuQeS790hy7X/em/tIYcO14ZcfBeszjgkoPJt/aQzv2uf96sMtls3JzjWtoqzg2HwRhr7exenumRBwThSMcIxC1jlSxNKdloMcyjeaCdyaJvao2Ld9Oi0hhacSAMCRJjX2KOxVAKdFs8I0mn2ADWsN3Qhhv1TebJIiB0LKNrXSoABj2uhpDxE5g4Yz0KSvQotF5zTwpxE9ZyVYVXx/7nV+4Na23V0nhWjAEzcGwnDsVouUzaNFzXPDSQJnjA9OHtW9NtKqXOAkkAGQR7FAC83YdXxcG8UYfaPQq9e0VGPc3dHGDCCyH0LrhDg2RIuu9ikqUaLWNcWm77cJx2rnb8qcspvypyyoOtRrMcwuaTDRsxynWoxbqe13dK5u/KnLKb8qcsoOuyu1zL4JLegH4LFC0sqEhpMjHEEfFcnflTllN+VOWUHdupdXC35U5ZTflTllB3bqXVwt+VOWU35U5ZQd26l1cLflTllN+VOWUHdupdXC35U5ZTflTllB3bqXVwt+VOWU35U5ZQd26l1cLflTllN+VOWUHdupdXC35U5ZTflTllB3bqXVwt+VOWU35U5ZQd26l1cLflTllN+VOWUFmz2VjnOLhJOB6pn8grbrHTu8X3n5qvo8yJXQiRCCjVs9MhhcIFN15pkiD245qEVLOHDJpJ4Jk4wR2YkD2q7Xst9pbMfrWqzrDQqABzmuiRx42TkdoHYp5tcKQstNlgOa5sQHAyYgiRjl7FarMbD3XQScTnqy6lC7RNmnFrMBAE4AEzgJwxVqo1paRfbjIz6JPuxVlmHIqaSps4Dmtb+9m6O2ehGW6k8gBrCYwEumJ69qnqaOpEy6pSM4YgHXHxUrdChpkFoO0MhcP/R3/AnrW5tNjXVCGg9BWh0rRmN1YDniY2fMdoVneougGDG0dELXeLOQzbxQu8OBStIeJYQ4TEjaMwo6mkKbHFr3Bp6cJ6tqnbZw3IAa8BGO1a1LG1xlzWOO0tlBpQtjKgljmuGGXSJCxVtzGOuvcGmJxyjrUrLKGzdDROcCJWH2NrjLmtJ2kbMfiggbpSiYiozFT1a9xrnOwDQSeoZrDbCwZMYOpoWXXZLS5sgYgnUdqCsdLUQAS8Y9B96s064e0OaQWnIjWtH0qRdjud4jXEmP+VuGNaIBaANQwAQQO0nSBIc8NIMY4a4+Klo2ttQEscDBgrBsVJ8yymZz4IM9alZZg0Q0NA2AQggrW9lN117gDgYg6zA+CmbVkAiCDiCj7K12YadWIn9ZrYUIECABkEGL5S+VtuR2puR2oNb5S+VtuR2puR2oNb5S+VtuR2puR2oNhiJRZAgIg5ljpC5RfOVJrfcOlTUt1uUNzLbsC/IMxhl7J9yhslWGUGXZmkDM5QAr9i9Ez1Qg2JN4QRdgyIMzqWrS/dDJFyMBGM9imRBDSLg514y2cMMlpRp4uL8ZOEtH5DLrVh5gE9Gswo98sHGewEDEXhhrQasp8E3gy9jENw6PdC0ZTO5gODDUjEtbAnXEqXfVPPdGd4LbdW8odqDSnTbdF4NJjE3darClVkYUYkTwTMa9SybZVkwymRJg7qBIn5LNO1VS4B1NgGv9qMD1ILNxnJHdWG02xi1s9DVDvp2xg6d0ELO+XcGbmJg8MdOXuShNcZyR3UuM5I7qq78eRgKefLGX/Kms1Zzjwg0CMIdKCS4zkjupcZyR3Vs10lwww6ZOWsakqlwaSwAujAEwD7UGtxnJHdS4zkjuqmbdVBgspA6wawEZdHSpWWpxLAQwXsxfB7NqCe4zkjupcZyR3VIiCO4zkjupcZyR3VIiCO4zkjupcZyR3VIiCO4zkjupcZyR3VIiCO4zkjuqtUp1JN0UonCWnLYrqIKlOm6W3m04xmGn2RISpT4+Ai6Y4PR1datqO0cR3qn4IKOjuKF0W5LnaO4oXSbkgL5Ct5SaMDnB1FxMwf2YOIJ6ekr69eKWz01T13fEr6P2f2XDtE5Rn5MZTT7t3lTowmTScSc5pDZG1bM8rdGtADadQAGQBT1xG3YSvPEX1PurQ/P1Y3y9CPlXo0x+yfhl+zG2du1Wfr3Yv9Xuf5XmiJ91aH5+pvl6X9e7F/q9z/KfXux/6vc/yvNET7q0Pz9TfL02n5bWR7g1raznEwAKcknZmrv08P4W1/0D815RSqOY5r2khzSCCMwRkV9PobTgeCLXbbVTfeAaWOF2DtlpjFebtH2bjhF4Rce/tDUZ9X3Vh0ju5cBSrU7oBmpTug9SuhcCw2WlVdNLSVoq3CCWisxw9sNyMLvhfH1MYxnh9ercBVevY6VUg1GNcRlInpVgoFzVWGj6Ibd3NkbI25rQ6Ks/Ms7v62BXVhBDQsjKZdcaG3jJjLsUyyiDARAsoMIiygwUQrKDCBZWAgFarYrVBybJ/wBjD/sDGMMm610rF6JnqhVLF6Clj/22/hCt2L0TPVCCdFglYLhtQZIlQuslJxk02E7S0frUFu6qBmR2puw2jHpQR7yo80zujWtzZ2H9xvYm7N2jZms7sMccs8ckGDZ2chvYsCyU+Q3sWTXbtHb+tq2vq3I03rT5Dc5yWd7M5DexbX0vpcjU0GHNrexZZSa3FrQOoLN9L6liGtSqGdzexpJBxZOEZHHbrWm52jnKR/8AjI//AErN9L6CpVsRc4u/ZmRrYCZjHHs7FllkcCzFkNzhgG3LYrV9L6DdFpfS+g3RaX0voN0Wl9L6DdFpfS+g3RaX0voN1FaOI71T8FtfWlc8B3qn4IKWjuKF0W5LnaO4oXSbkgLxS2emqeu74le1rxS2emqeu74lfa+x/Fn+znqIURF99zEREBERBs0AkSYG3Yu1YLNZg4tFWnXccmuoViRGwNIVPQ9ms9WrdtFU0mkcEhsy4kCDsX3tg8l9HwAw33tGL21SHdfBOC+d23tOOn+GZmP0j+Z/hrGLcSyaPpuqMYKNMBzgD/8Az2luZ2l0DrK+/C5H1Zsuyr/XqeJWLDoajZ336YfeiOFUc4R1E9C+Fr6uOrxuf3j/AF1iKXygQoF5FZWFlYQZREQYCysBZQYWVhZQYKysFZQFgLKwEArVbFaoORZRjZ/5H5NXTsXomeqFzrID+wMGBRAy2ga/YujYvRM9UIJS0HNYLBsW6wgqWyhfgXGOaMw8fBV6NncxvBo0WvBwgmIjOYmfmrlpttOlG6Ouy1zsjk0SfitH6Rotuy8cIEjXlipSVxtXFl4XoaWcgyZzkHJb7i4SBTYJHCN4yTjOrHV2lWX2yk1101Gh2wlaHSNGQN0bwjAg9E+xVUAoOAwpUhOeJ6OjFSO3UMFxrJB4pJAjVjHUpKdvougCo2SYAnGTqhN/0edZ3h+taszaUipOrGLzGDKYcSY7Fmat7ituztxidnUtjpGiDG6NymZwzjPLNBpGjBO6NEYmTHtxUG1K/BvgTOELdam20gJ3RsESMc8/kexaDSVEhxD5uguIAMwMyBrRUqKNukKJ/fAOx2Bzu5HpU1Cs2o0OYZB1oNUUyIIUUyIIUUyIIUUyIIUUyIIUUyIIUrejd6p+CmUVo4jvVPwQUdHcULotyXO0dxQui3JBleKWz01T13fEr2teT2nyethqPIs1QguJHB6V9j7Kzxwyy3TEcmM4cZF1Pq5bf4ar3U+rlt/hqvdX2+/0vmj1hzqXLRdT6uW3+Gq91Pq5bf4ar3U7/S+aPWCpctF1Pq5bf4ar3U+rlt/hqvdTv9L5o9YKly19T5MWWx8A1bW5tSphuTC5mM4S8R8sVzGeTlskXrNVicYbJjXC71k0O4llPcKjRg2/UsNIgdLjeleTtevp5YbYz9Jj/VxiXf8Aoix/xFT/AO27xLp6Oo06dO5SeXtBmTUNQyekkrh/VI85Q/8Ap01Zsmg7RRBFG1MpgmSGWVjZO3BfCz25RW+/1v8Ap1d4oFWsNGqxpFatuzpkOuBkDDCB7VLuzASC5oI1SvLMVKpVhR74Zy29oQ2hmHDb3goJUUW+GctvaE3dnLb2hBIFlQttDIHDb3gs74Zy29oQSLKi3wyeO3vBN8M5be0IJCsqJ1oZHHb3gm7s5be0IJVgKPd2ctvaFsx4diCCOgoNitVsVqgoWI/saQkTubcPYFbsXomeqFzrI30B2UR7wPkujYvRM9UIJ0REEFoslOrG6NDoyn2H8gohoyjyNv7x1568uhW0QV6lhpOmWDHOMNh1dQWrNG0W4BkDZeMYiCYnONatIgqt0bRDmuDAC3IycMSdu0rRmiaAum5JbEEuJOGWtXUQU/omzxG5jtPzWfoyhM3NUZn5q2iCtU0fRdF5gN2Y6JMn3rO8KUOF3BwIOJ1561YQoKZ0VQJksxiJvHHGccccVYoUGU23WNut2DJSIgIiICIUQEREBEWtSo1gvOIA2lBsi51qtrow/ZtOAc4S93Q1u3r7FUbpOrRBL2PfTGO147Pz7Rkg7iLDXAgEYg4grKAo7RxHeqfgpFHaOI71T8EFHR3FC6TclzdHcULpNyQFXo2UNwcGkauAArCo07AwltQOqDG9dDyGyTOSC1uDOQ3uhNwZyG90Ko3RLAAA+qB0VCFbo0rjQ0EkDW4ye1A3BnIb3Qm4M5De6FIiCPcGchvdCbgzkN7oW1SoGtLnYAZqv9I0uV7j8kE24M5De6E3uzkN7oUP0jS5XuPyT6Rpcr3H5IJt7s5De6E3uzkN7oVappNoiAXAjPL4rT6VbyT2hcctfTxmplic8Y4Wub3ZyG90LZjYnGZM5RGGSo/SreSe0J9Kt5J7Qp8TpdTvMeroqKowkgh0R0Ts/wAj2qn9Kt5J7Qn0q3kntCfE6XU7zHq6Cw8EggGDGB2dKofSreSe0J9Kt5J7QnxOl1O8x6r1NpAgmfZHsWy5/wBKt5J7Qo6ukg7/ALcjpdCfEaXVO8x6r5EPkvw5MD4qW8JicVzDpFpIJYZHTgsC3sDi64Z60+J0up3mPV0KnCwa4AjPI+5bsOGcxmVy6ekOFxM/tbfYugxvBwGJ1E+6VvDUxz8MtY5xlybuqNDS4uAaMyTh2pTiJBkHEEKF1nLmFk3NYLTJB9ohSWeiKbAwEmNZzPSV0aSFarYrVBzLHSltF+yk0e4K/YvRM9UKjY6oDKLIxNIHsAV6xeiZ6oQToiIMIixeGOIwz6EGUXDtWmnXiKYAaNZxldDRts3ZhJEOBg7OtcsdbDLLbDrlo5447pXERF1chCUXx3/UHSVVjaFnouLTWdi4XmnOAA8YazIQfUU9IUHOutrU3OwwDweNxe2CrK8+0n5CtstkdXpWipu9Ft8nAA3cTdjEdGK+j8itIvtNgpuqYuZwCYON3AEk8Y7SEHeREQERaVazWNvPcGjaTAQbrDXgkgHEYHown81RraTZdJZLhrfHB9nKPQFpo6d0LnHGoC5w2EGI9gw9ixvi6hvZNXLpIERbYFxdO13U6tBzSARIxOEFzAc8Jidi7LnACSQB0rh6VtFOq+lceCDLZB136cjrQtYpvY6TTa59TJxfhGuHO1dTZ6lDaaLZG6m+44hgGHsb+Z7QuTZLUacGi6KbqhBfd3MECZwi6DhEkDowy7lB7XtO4MOZDnP2jAgmZcfbHSgr2Cu+zFtMsAoEnG9eLCZOOGMnrzzXca4EAggg5EZFcSvRbJDgatTYQDH5NHvPSlhrVLOQxwG4QeLJLDsykjPH4BB3FHaOI71T8Fux4cAWkEESCDII2rS0cR3qn4IKOjuKF0W5LnaO4oXRbkg2Udn9Gz1R8FuSo7Mf2bPVHwQSosSkoMosSkoNapIabovHUMpVXdK3Mt7wVqqXXTdi9qnJVr1o5NP3oMbpW5lveCbpW5lveCzetHJp+9L1o5NP3oOVXdL3HaStFl+Z61qvz+U3MvBPNlFhFEZRYRBlFhEGUWEQZU1lobo67MYSoF19H2UMF+ZJGrJd9DT7zOvJvTx3S0p6MhwN+YM5LoJKSvrYaeOHhh68cYx5MrASUC6NBWq2K1Qc2x027nRfOIpNA9oEq9YvRM9UKrYvQUv5bfgFasXomeqEE6IiDBXytHRlpp2kQ0xeMvnAg5kr6pFvHOcYmOrnnpxlMT0fGfRlta99xsXjrPsGqBgu1oqw2qi5t99MsLpeBmZbllqMe9dlF58NLHCbenPVyzilZ9GqXuIrQ0gQ24DG3FNxrc8P6Y+asoulOO1WpCpTY81H7oRLhDQ3DOF8l5RWGrpClZy4MbfcdyN4yA7EBwm6cAvti0EQcQdSo0NEUadTdGtM4kAmQJM4DIexZmJ4U66c4YxMS+HqWTTNppizV6gbRljXuJZrORIxcRGImcs5X2mgLDTstmZZ2ODi0S6HEy794gE4AnUsWjQFCo5znhxLjJ4XTKsUtG02PDxekTEnaIWptiK829elVc8FlW60DFtwGTtlY3Gtzw/pj5qyiUzSvZ6VVrnF9QPachciPaodIm/UoUs5ffcPssxntuhXlUraMpPqGo8EuIg8IxGyJhSY4VDeFYzxc4Ne+vVHCfUbUim48SkwtBvAZTjA1ntXQsTBeJbxGi407Y4x7fgq1Cw1b9UcGlSc7AN4xaAGgDU0QOvHUunTYGgNaIAwAWYx426Z5RyhzdIWm2MeRRoteyMDMnskLiu0xaL12sSzoLSz4Y+9fXLWpTa4Q4Bw2ESplhM8pePPTynlk+dpVQceAelwk9tQqLSFCpVa19GXPpEubG5kY54NM/FdqpoeicWgsP2T+R/JV6mjHjIh46Yn+8OntCm3KGYxyhxaNcVnMDqha4VHF9NuEy0wZzwjUdeKWSuWCm5t5tJznEkg05bLsHFuDdWYBG2MpNI2B7DurP2b24kbkQD0yCWz8dihstdtY07z3F7TUvU2GAZvOvN1zqGOtbxyvhLrjlfCX0FK0UyA2mx16JDYgdJvcX3lZNmkzUId9kcQeL24dAVbRrQ1zYvcKmXQ4AOElpggAYq3UettrGjvQUvVCktHEd6p+Ci0b/6el6gUto4jvVPwQUdHcULpNyXN0dxQui3JBlR2f0bPVHwUi0s/o2eqPgg3RZRBhFlEEFrpuewtY66TrVD6Orc77z8lftYZuZvzd1wufNl2u96DP0dW533n5LanYKzXAmrIBBOJ1exaTZdrvegNn/cLr37ueepSeQpvdJJ2lYRF+ffPEREBERAREQEREG9GmXuDQu9TYGtAGQwXP0TT4zvYF019Xsentw3dXq0cai2EWUXsdmECysBAK1WxWqCjYvQUv5bfwhWrF6JnqhVbF6Cl/Lb+EK1YvRM9UIJ0REGEWUQYRZRBhZREGEWUQYRZRBhFlEGEWUQYRZRBhFlEGEWUQYXE0zodzju1nN2qDJA1jX7cl3FhSkmIl81YLTUNSKjf2l1wDWjE4t1avbC6tOwufjVMDkNP4nfkO0roIqrDGBoAAAAEADILS0cR3qn4KVRWjiO9U/BBR0dxQuk3Jc3R3FC6LckGSJUdmA3Nnqj4LWyVjUYHEAEk4DoJC2s/o2eqPggkhIWUQYhIWUQQ2rCm4gkdLRJ9i5u6u5dbuf5XVqsvNIkidYwKrbx/1aveQVN1dy63c/yssqEkS+tEjNmHxVreP+rV7yw+xloJFSoTBzdhkpM1BLkviTAwnBYRF8CXzxERQEREBEWJQZUtOzPcYDT1kQPgrOjbMHEuMFowg7V1QF7dDsu+N2Uu2Glui5aUaQY0ACFvCLK+nEREVD1MQkLKKjEIFlYCAVqtitUFGxegpfy2/hCtWL0TPVCq2L0FL+W38IUlG0tZToNMy+GiNsE49isRaTNLqLBIGZWC8DCRPWoqOvaWU4vmJnHq/wCVH9IUovXxExkej5hTPeBmsCo07Pcgidb6QIF7ExAg64j4hZ39Sjja4yOY1e9Sbq3aO0LN4IIfpClE3sJAyOtZbbqZF4Ow2wVIKrTs9yX29CCE6QpASX4dR2wVltvpEF17AGDgVKXtGce5L7ehBGbdSDg0uxMRgcZyjBZbbKZMB2MxEFSXxmsB7dUIITpGiM3+4rfflPlapy1HJb7o3o9yzfb0IIRpCkQSHyBmYPUsut9IGC7HqOr/AJClDhsTdG9CCJ1upDN2sjI5gwfetqFrZUJDHSRnhC3vtWN0bOqUEqLS+l9Bui0vpfQbotL6X0G6LS+l9BuorRxHeqfgtr60rngO9U/BBS0dxQuiBgudo7ihdEZIK+j3XqIIEcbbtOOOKls07mz1R8FrY6Bp0wwmSJxyzJP5rez+jZ6o+CDdFlEGEWUQRVyAxxcYbGJXNO9uW73/ACV+23tzN1wadpiD0YrmX6/OM7zfmg14HPnuuTgc/wD2uW1+vzjO835per6qjO1vzXn+G0/qXPu8UVWkAGkOvB06iMutRqS0ul7j06lGvk6lbp28nlyq+AiIAsMgEldD6LPLHYrLP2dBsyMNQkiehVt3PO1P6a92Ojhh4+N/t/LvGGMc2WaLIIJcCAdiv7k3kjsXP3c87U/ppu552p/TXbDPTw8Me8f23jOOPKHSa2MoHsWVzN3PO1P6abuedqf010+Ix+pj+2t8OmigsskXi9zgcpEQrC74zcW3E2wiyi0rCBZWAgFarYrVBRsXoKX8tv4QoqoJFigZVB+BylsXoKX8tv4QoqjyBYoJEvgwcxcdgVvDxManh9P+uo5gcIIkLU0WmDGWXQt1lYbVqtBji280G7xZ1fqAqz7LHBYymKcTBmb0EdmSmtlcMc3F+X7vWM/1rKqutst4NR8kSDd69R9qcFjGedJqNnwh1NgAIuhpnXM4xCxUs2ymw3iQST+7tKwbc108J4kiDGUR7lo23BrxL3kDMXcDnr/WSWu2Uhouknc2Fw4pvEYa+rIdi13sTJNKlJO0mcZxWRa7oIvvxBiQMMscVgWy6QXPeQSf3R74SzZPRuLNdMtYwRlnOrX1yo97GSTSpZzmZzKyLVLHEPfgcXRBwGQBw6UfbgYIe9oMgC5sw+KTNkY5Ryb1ab3SCxhbMgFx2R7FrUswht2kzOXSTtmR7cVoLQZaBUqYjkjWMzsyUtO1EOLXOc6QTkB2R1pZslq6zEtg0qczlJg4fP3LG4GSNyZdMEmTjlq6MVq61xE1H5TxdUGP10LH0gJbdqP4w/cmRhhP5pZslbpUGtghoBjV8OrFatsdIZMGc/rsCCrN7hOh2XRif17FFStwDm8JxDhgCOrWP1ilmyUz7LTMy0Y59srLbOwZNAGwfroC0dpRgMEOn1T8lh2lGASQ4AZm6lmyVlFo22AiQDCjfpJgmQ7Axl0T2KWbJTotd9DYU30NhSzZk2Ra76GwpvobCrZsybItd9DYU30NhUs2ZNkrejd6p+C130NhW1YzTJ+yfgqk4zHNS0dxQuk3Jc3R3FC6LckRklR2Y/s2eqPgpVHZ/Rs9UfBBvKSsogxKSsog0qU2uEOEhRbzpcgJbqpZSc5uYiO1cn6UrdHZ/hB1t50uQE3nT1MAXJ+lK3R2f4W9LSVUvaDEFwGW09SCvUaA4gZArCy4ySelYX5+eb58i3oOIe0tEnUForVgdTa6890EZfmtacXlHGmseMunVtF0xce7DMCVBVrFwENrN6gpd/UuWFmpZGuN684TsdAX1srzj8M29c3PKVSXf+R2BOF/5HYFY3i3lP7ybxbyn95c+6z+pY2yr8L/AMjsC2YHEgTXE6zEKbeLeU/vJvFvKf3k7rP6ldspKNEtM33O6CppUFKyta4EOcet0hTr04RUcqdIJSVlFtWJQLKwEArVbFaoKNi9BS/lt/CFDWysP8z/APDlNYvQUv5bfwhbU7NujLO6Y3Mh3XgRHvWsJqbljOLjgvIiLLalb6RLmkX/APYOkZ4/rFQOsxABv1TwQcG45HPp6OpXq1O8Rw3NjU04Hr/WtY3IXQLzuuccoRrdKnSs5Im/Uwx4TInLV+XWtadmc50B9RuviQNe1XmUoBF95mczl1I2nBBvvMDKcDnnh0+4JRvlT3Em5wqmIP7vVn+tqy+zyJD3gtOMNxOMZawr7IAAkmNZzW14KUb5c+lRljuE84EglsGI1LQUjdzqi6eTiZP5Lp3gl4K0b5c40MGuvPIGoNxmJkjNZbRl+bx0EYauldC8EvBKN8uZVokDB1XDgw1s5A8L9dC3ZYXEkbs/gkfugA5HsXQvBLwSjfKuLJ0+5RiwHg/tHYe/rVy+EvhSl3yp/R5n0jur2ytW6OdI/auPQQMVaryQLj7pDgThMjWPaFJfCUm+VfevT7k3r9r3KxfCXwlLvyV969PuTevT7lYvhL4Sjfkr716fcm9en3KxfCXwlG/JX3r0+5N69PuVi+EvhKN+SvvXp9y3rCKZH2T8FLfCjtHEd6p+CqTlM81LR3FC6LclztHcULpNyRkUdn9Gz1R8FIROajswG5sw/dHwQSIkJCAiQkINKpN03W3jqGUqrfq8w3vBTW29uZukNO0mIXMmrz7O/wD5QXb9XmG94LN6qc6DQNZvDDpVGavPs7/+VtTNS8JrNI2B8k9Gak8hUKIi/Pvniv2Ci9zSWuuieSCqC61hDhQF0NmTE5Zr09lxic+PR10ovJuLPUn0g7gVpVprcmn2lJrcmn2lfSxyjHlE+70RNdUlWzMeZcJPWQo94UuSe8fmtqZqyLzWRrgmVPCsYYZcZx9lqJ8lbeFLknvH5pvClyT3j81ZhIV7rT+WPQ249FbeFLknvH5qajRawQ0QOuVvCQrGnjjNxCxjEchEhIW1ECQgQCtVsVqgo2L0FL+W38IVqxeiZ6oVWxegpfy2/hCtWL0TPVCCVzQc0LB8PcjmzrhYLenYgwQsR0hZLVi6UCOkJHSEulLpQI6QkdIS6UulAjpCR0hLpS6UCOkJHSEulLpQI6QkdIS6UulAjpCR0hLpS6UCOkJHSEulLpQI6QkdIS6UulAjpCR0hLpS6UCOkJHSEulLpQI6QkdIS6UulAjpChtxIYIOZj3FTXSoLfxG9f5FBBo7ihdFuS52juKF0m5IMqKz+jZ6o+CkKqWSqXNutex12BgD7xKC4iiiptb2H5pFTa3sPzQSLKq065c5zQ9hLcxBn4qWKm1vYfmgWg8A8G/9naqV5v8ACnu/4VxzakGHNB23Th71TiqXFm+GzGW59fT0IF5v8Ke7/hLw/hy37V3LpWv0W/nR3T4k+i386O4fEvPOep8nvDF5dHORXTo114N3Vkkcgz+LrVxlkc1l1rmA7S2fzXhx7JqTz4PPGll5uNK6lFjRRYXPc0dDozW+9q3Os/pf5Wzb7S1jn0yTMS2OwSu+j2ecZm/r3dMNOYRfseff31KyytcARUqEHXfUu5u+x3f8rYNftZ3T816MdGPOnSMOrSnZQ0g33mNRdIU6r1KjmloLmC8YEg/NSRU2t7D811xxjHhDURSVYUcVNrew/NR1qjmCXOYBIGII/NaVZRRRU2t7D80iptb2H5oJUVes97GlxcwRtBj4rcXziCyOo/NBIgUcVNrew/NbsmMYnoQZK1WxWqCjYvQUv5bfwhWrF6JnqhVbF6Cl/Lb+EK1YvRM9UIJXAnIwkHbsWywg0LSsXTsWXVgHtYc3AkdQifxBQfSVCJ3VvagmunYl07Foy3UnOutqNLiYgHXj8j2FRUNKUXgG9dkwL2E5fMYHETigsXTsS6dih+kqHOt7ev5HsK1OlKW5l4dIaQCMiCTAmcBjrOCCxdOxLp2KEaRowSagEAEg4EZZj2jtW2/6MxujZm7nry+OCCS6diXTsUiygiunYl07FKiCK6diXTsUqIIrp2JdOxSogiunYl07FKiCK6diXTsUqIIrp2JdOxSogiunYl07FKiCK6digt/Eb1/kVbVXSPEHX+RQV9HcULotyXO0dxQui3JBlUX2iu3/ALDXY/u1IwnCZCvIEFSlXrF0Oohoxx3QHq1da1faK4cQKAImJ3QZTnHUrqIKu61ZMUhxo4wHB29fQtN8Wif/AE7f6o+SuogqC0VoP7EAiML4M447NSw60VpMUARJAO6AEjbEK4iCiLVaP4cf1QfZktxaK0TuObAYvjjTi0nqVtEFNteuXAGi0Dbuk/krNFxLGl4AcQJE5HYt0QJTBEQJSURAwSURAlERAlJREDBERAlEQIBWq2K1QUbF6Cl/Lb+EK1YvRM9UKrYvQUv5bfwhWrF6JnqhBMiw4HUYWIO3Ygjr2Zr3Nc6QWyAQ4twMSMOoKm/QtG5cZeZjN4OM5EZnoce1Wq7Xki64DbLZnL/KrVjUZi6sA3+X0bZzwlBNZtHUaRvMbBx17c/eStDomgXBxaS4HMvJOrCZywCUqhIJNQECZ4EdRCU7WyCb4Ia0ThHthFqSnomg0Q0EZYh7gcL2uftO7VLvGnBAkSAMHEHAkjX9orLaoO0Z5iMv+Vq20MMY5kjs/wCERANC0LpaQ4tyAvmAIAgY/ZCk+iqEt4J4JlsvJAxmBjlKkdXaL32cDh7fzWd1bhiMckLWJSVXpVWvEtP6w+YW6CWUlRIgllJUSIJZSVEiCWUlRIgllJUSIJZSVEiCWUlRIgllVdI8Qdf5FSqG38RvX+RQQaO4oXSbkubo7ihdJuSAgRAgyiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICwFlYCAVqtitUFGxegpfy2/hCtWL0TPVCq2L0FL+W38IVqxeiZ6oQToiII35qvaX3RN4t6Q29t1Kw/NV7STGF//AG9R/XYixzRsqSDw52FzYI6hrUQtPBM1XOGUimQQccoGakY43zi/DlDgj34rBqkOJvvMYwKfBPt1qNsPqludY68RTkapyUlB5cJFQlpMN4Ebe3BZawgyXPOq71x2rDpceM9t2J2ax1foIzM8ODV5c5/BrFg2XQdm3Xj71vSLiC3dJdPGugazq9i1h7Za573XpAddAIyGEe3NGUnHKs/PEEDafl7kYoouIlxqFwAkgtAOQMoA5xMVSMcro2qwwECCZ6VslFKz3G7BeQQYvBkzhsUbahBM1XGDiLnT0frFXUVKRNtDboOOzinONiPtDWkgzgYy2qVEVE20NLromfVKlREBERAREQEREBERAUNv4jev8ipli0NBpmdQJHYgp6O4oXRbkudo7ihdFuSDKBECDKIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgLAWVgIBWq2K1QUbF6Cl/Lb+EK1YvRM9ULy+l5fWtjGsFOgQ0ACWunARylhnl7bWgACkAMuCfEg9XReVecC3f6XdPiTzgW7/S7p+aD1J+ar2gYfv/7eo/rsXmn1/t2yl3T4lq7y7thzFLuu8SLD0l7obEvHC4oxdxp7D8Co2HGf252giARBwXnQ8u7YBEUomcnbZ5SwfLq2YmKePQ7xKLuem8a6eGIk9eIzURpS4Y1uFjg7DWfZ/wALzj6+W3KKXdd4lj69WyIilB6HeJKZemvwaMXuxBEZ9R6FBdknGvn1DM5Lzk+XVsIiKUdTvjeWzfL22gQBSj1T4ko4PR6TiCABVPrD81Za6f1+ti8w+v1t2Uu6fEn1+tuyl3T4lR6ii8u+v1t2Uu6fEn1+tuyl3T4kHqKLy76/W3ZS7p8SfX627KXdPiQeoovLvr9bdlLunxJ9frbspd0+JB6ii8u+v1t2Uu6fEn1+tuyl3T4kHqKLy76/W3ZS7p8SfX627KXdPiQeoovLvr9bdlLunxJ9frbspd0+JB6ii8u+v1t2Uu6fEn1+tuyl3T4kHqKVvRu9U/BeXfX627KXdPiQ+X1tIgilHqnxIPRNHcULpNyXlNHy9tbBAp0Pa13iUw/6j23mrP3H+NB6ggXmHnHtvNWfuP8AGseca2c1Zu47xoPUEXl/nGtnNWbuO8aeca2c1Zu47xoPUEXl/nGtnNWbuO8aeca2c1Zu47xoPUEXl/nGtnNWbuO8aeca2c1Zu47xoPUEXl/nGtnNWbuO8aeca2c1Zu47xoPUEXl/nGtnNWbuO8aeca2c1Zu47xoPUEXl/nGtnNWbuO8aeca2c1Zu47xoPUEXl/nGtnNWbuO8aeca2c1Zu47xoPUEXl/nGtnNWbuO8aeca2c1Zu47xoPUEXl/nGtnNWbuO8aeca2c1Zu47xoPUEXl/nGtnNWbuO8aeca2c1Zu47xoPUEXl/nGtnNWbuO8aeca2c1Zu47xoPUEC8v841s5qzdx3jWfOPbeas3cd40Hp5Wq8y849t5qz9x/jWPOLbeas/cd40HyCIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiICIiAiIgIiIP//Z\n", "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import YouTubeVideo\n", "\n", "YouTubeVideo(\"5Zf6DQaf7jk\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:22:41.021809Z", "iopub.status.busy": "2022-07-27T19:22:41.021259Z", "iopub.status.idle": "2022-07-27T19:22:44.310730Z", "shell.execute_reply": "2022-07-27T19:22:44.310048Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "
\n", "

Client

\n", "

Client-7bcf1e8b-0de1-11ed-a455-000d3a8f7959

\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "
Connection method: Cluster objectCluster type: distributed.LocalCluster
\n", " Dashboard: http://127.0.0.1:8787/status\n", "
\n", "\n", " \n", "
\n", "

Cluster Info

\n", "
\n", "
\n", "
\n", "
\n", "

LocalCluster

\n", "

4980102c

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", "
\n", " Dashboard: http://127.0.0.1:8787/status\n", " \n", " Workers: 4\n", "
\n", " Total threads: 4\n", " \n", " Total memory: 7.45 GiB\n", "
Status: runningUsing processes: True
\n", "\n", "
\n", " \n", "

Scheduler Info

\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "

Scheduler

\n", "

Scheduler-89c11552-5d0d-4279-8917-35605cc96a56

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Comm: tcp://127.0.0.1:38375\n", " \n", " Workers: 4\n", "
\n", " Dashboard: http://127.0.0.1:8787/status\n", " \n", " Total threads: 4\n", "
\n", " Started: Just now\n", " \n", " Total memory: 7.45 GiB\n", "
\n", "
\n", "
\n", "\n", "
\n", " \n", "

Workers

\n", "
\n", "\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 0

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:43429\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:46693/status\n", " \n", " Memory: 1.86 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:35293\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-gt1p0g05\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 1

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:46751\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:35987/status\n", " \n", " Memory: 1.86 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:40509\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-sq4p4isk\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 2

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:35545\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:46021/status\n", " \n", " Memory: 1.86 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:38883\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-2iddzp9o\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 3

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:43855\n", " \n", " Total threads: 1\n", "
\n", " Dashboard: http://127.0.0.1:37433/status\n", " \n", " Memory: 1.86 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:32787\n", "
\n", " Local directory: /home/runner/work/dask-examples/dask-examples/machine-learning/dask-worker-space/worker-gfst4qpj\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dask.distributed import Client, progress\n", "client = Client(n_workers=4, threads_per_worker=1, memory_limit='2GB')\n", "client" ] }, { "cell_type": "markdown", "metadata": { "keep_output": true }, "source": [ "## Distributed Training\n", "\n", " \n", "\n", "Scikit-learn uses [joblib](http://joblib.readthedocs.io/) for single-machine parallelism. This lets you train most estimators (anything that accepts an `n_jobs` parameter) using all the cores of your laptop or workstation.\n", "\n", "Alternatively, Scikit-Learn can use Dask for parallelism. This lets you train those estimators using all the cores of your *cluster* without significantly changing your code.\n", "\n", "This is most useful for training large models on medium-sized datasets. You may have a large model when searching over many hyper-parameters, or when using an ensemble method with many individual estimators. For too small datasets, training times will typically be small enough that cluster-wide parallelism isn't helpful. For too large datasets (larger than a single machine's memory), the scikit-learn estimators may not be able to cope (though Dask-ML provides other ways for working with larger than memory datasets)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Scikit-Learn Pipeline" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:22:44.314191Z", "iopub.status.busy": "2022-07-27T19:22:44.313670Z", "iopub.status.idle": "2022-07-27T19:22:44.773748Z", "shell.execute_reply": "2022-07-27T19:22:44.773086Z" } }, "outputs": [], "source": [ "from pprint import pprint\n", "from time import time\n", "import logging\n", "\n", "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn.feature_extraction.text import HashingVectorizer\n", "from sklearn.feature_extraction.text import TfidfTransformer\n", "from sklearn.linear_model import SGDClassifier\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.pipeline import Pipeline" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:22:44.777601Z", "iopub.status.busy": "2022-07-27T19:22:44.776949Z", "iopub.status.idle": "2022-07-27T19:22:53.129035Z", "shell.execute_reply": "2022-07-27T19:22:53.128390Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading 20 newsgroups dataset for categories:\n", "['alt.atheism', 'talk.religion.misc']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "857 documents\n", "2 categories\n", "\n" ] } ], "source": [ "# Scale Up: set categories=None to use all the categories\n", "categories = [\n", " 'alt.atheism',\n", " 'talk.religion.misc',\n", "]\n", "\n", "print(\"Loading 20 newsgroups dataset for categories:\")\n", "print(categories)\n", "\n", "data = fetch_20newsgroups(subset='train', categories=categories)\n", "print(\"%d documents\" % len(data.filenames))\n", "print(\"%d categories\" % len(data.target_names))\n", "print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll define a small pipeline that combines text feature extraction with a simple classifier." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:22:53.132634Z", "iopub.status.busy": "2022-07-27T19:22:53.132192Z", "iopub.status.idle": "2022-07-27T19:22:53.135995Z", "shell.execute_reply": "2022-07-27T19:22:53.135342Z" } }, "outputs": [], "source": [ "pipeline = Pipeline([\n", " ('vect', HashingVectorizer()),\n", " ('tfidf', TfidfTransformer()),\n", " ('clf', SGDClassifier(max_iter=1000)),\n", "])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define Grid for Parameter Search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grid search over some parameters." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:22:53.140031Z", "iopub.status.busy": "2022-07-27T19:22:53.139614Z", "iopub.status.idle": "2022-07-27T19:22:53.143379Z", "shell.execute_reply": "2022-07-27T19:22:53.142908Z" } }, "outputs": [], "source": [ "parameters = {\n", " 'tfidf__use_idf': (True, False),\n", " 'tfidf__norm': ('l1', 'l2'),\n", " 'clf__alpha': (0.00001, 0.000001),\n", " # 'clf__penalty': ('l2', 'elasticnet'),\n", " # 'clf__n_iter': (10, 50, 80),\n", "}" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:22:53.147810Z", "iopub.status.busy": "2022-07-27T19:22:53.145712Z", "iopub.status.idle": "2022-07-27T19:22:53.151018Z", "shell.execute_reply": "2022-07-27T19:22:53.150392Z" } }, "outputs": [], "source": [ "grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=3, refit=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To fit this normally, we would write\n", "\n", "\n", "```python\n", "grid_search.fit(data.data, data.target)\n", "```\n", "\n", "That would use the default joblib backend (multiple processes) for parallelism.\n", "To use the Dask distributed backend, which will use a cluster of machines to train the model, perform the fit in a `parallel_backend` context." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:22:53.154020Z", "iopub.status.busy": "2022-07-27T19:22:53.153618Z", "iopub.status.idle": "2022-07-27T19:23:00.999089Z", "shell.execute_reply": "2022-07-27T19:23:00.998395Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 3 folds for each of 8 candidates, totalling 24 fits\n" ] } ], "source": [ "import joblib\n", "\n", "with joblib.parallel_backend('dask'):\n", " grid_search.fit(data.data, data.target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you had your distributed dashboard open during that fit, you'll notice that each worker performs some of the fit tasks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parallel, Distributed Prediction\n", "\n", "Sometimes, you're train on a small dataset, but need to predict for a much larger batch of data.\n", "In this case, you'd like your estimator to handle NumPy arrays and pandas DataFrames for training, and dask arrays or DataFrames for prediction. [`dask_ml.wrappers.ParallelPostFit`](http://ml.dask.org/modules/generated/dask_ml.wrappers.ParallelPostFit.html#dask_ml.wrappers.ParallelPostFit) provides exactly that. It's a meta-estimator. It does nothing during training; the underlying estimator (probably a scikit-learn estimator) will probably be in-memory on a single machine. But tasks like `predict`, `score`, etc. are parallelized and distributed.\n", "\n", "Most of the time, using `ParallelPostFit` is as simple as wrapping the original estimator.\n", "When used inside a GridSearch, you'll need to update the keys of the parameters, just like with any meta-estimator.\n", "The only complication comes when using `ParallelPostFit` with another meta-estimator like `GridSearchCV`. In this case, you'll need to prefix your parameter names with `estimator__`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:01.002484Z", "iopub.status.busy": "2022-07-27T19:23:01.002047Z", "iopub.status.idle": "2022-07-27T19:23:01.227161Z", "shell.execute_reply": "2022-07-27T19:23:01.226471Z" } }, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "from sklearn.svm import SVC\n", "from dask_ml.wrappers import ParallelPostFit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll load the small NumPy arrays for training." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:01.230642Z", "iopub.status.busy": "2022-07-27T19:23:01.230198Z", "iopub.status.idle": "2022-07-27T19:23:01.289344Z", "shell.execute_reply": "2022-07-27T19:23:01.288604Z" } }, "outputs": [ { "data": { "text/plain": [ "(1797, 64)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X, y = load_digits(return_X_y=True)\n", "X.shape" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:01.292868Z", "iopub.status.busy": "2022-07-27T19:23:01.292448Z", "iopub.status.idle": "2022-07-27T19:23:01.297885Z", "shell.execute_reply": "2022-07-27T19:23:01.297267Z" } }, "outputs": [], "source": [ "svc = ParallelPostFit(SVC(random_state=0, gamma='scale'))\n", "\n", "param_grid = {\n", " # use estimator__param instead of param\n", " 'estimator__C': [0.01, 1.0, 10],\n", "}\n", "\n", "grid_search = GridSearchCV(svc, param_grid, cv=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And fit as usual." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:01.300843Z", "iopub.status.busy": "2022-07-27T19:23:01.300263Z", "iopub.status.idle": "2022-07-27T19:23:02.566998Z", "shell.execute_reply": "2022-07-27T19:23:02.565866Z" } }, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(cv=3, estimator=ParallelPostFit(estimator=SVC(random_state=0)),\n", " param_grid={'estimator__C': [0.01, 1.0, 10]})" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll simulate a large dask array by replicating the training data a few times.\n", "In reality, you would load this from your file system." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:02.570655Z", "iopub.status.busy": "2022-07-27T19:23:02.570436Z", "iopub.status.idle": "2022-07-27T19:23:02.575081Z", "shell.execute_reply": "2022-07-27T19:23:02.574223Z" } }, "outputs": [], "source": [ "import dask.array as da" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:02.578295Z", "iopub.status.busy": "2022-07-27T19:23:02.577878Z", "iopub.status.idle": "2022-07-27T19:23:02.608811Z", "shell.execute_reply": "2022-07-27T19:23:02.608081Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Array Chunk
Bytes 8.77 MiB 898.50 kiB
Shape (17970, 64) (1797, 64)
Count 11 Tasks 10 Chunks
Type float64 numpy.ndarray
\n", "
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 64\n", " 17970\n", "\n", "
" ], "text/plain": [ "dask.array" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "big_X = da.concatenate([\n", " da.from_array(X, chunks=X.shape)\n", " for _ in range(10)\n", "])\n", "big_X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Operations like `predict`, or `predict_proba` return dask, rather than NumPy arrays.\n", "When you compute, the work will be done in parallel, out of core or distributed on the cluster." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2022-07-27T19:23:02.613227Z", "iopub.status.busy": "2022-07-27T19:23:02.612777Z", "iopub.status.idle": "2022-07-27T19:23:02.630320Z", "shell.execute_reply": "2022-07-27T19:23:02.629613Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Array Chunk
Bytes 140.39 kiB 14.04 kiB
Shape (17970,) (1797,)
Count 21 Tasks 10 Chunks
Type int64 numpy.ndarray
\n", "
\n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 17970\n", " 1\n", "\n", "
" ], "text/plain": [ "dask.array<_predict, shape=(17970,), dtype=int64, chunksize=(1797,), chunktype=numpy.ndarray>" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predicted = grid_search.predict(big_X)\n", "predicted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point predicted could be written to disk, or aggregated before returning to the client." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }