From 34d563c1fc73b8db9f9cd32fa375b5a871951044 Mon Sep 17 00:00:00 2001 From: Mallori Harrell <6825104+mallorih@users.noreply.github.com> Date: Wed, 17 May 2023 15:42:15 -0500 Subject: [PATCH] feat: Create spacy notebook example (#593) * add new notebook for spacy --- example-docs/fake-memo.pdf | Bin 0 -> 13374 bytes examples/spacy/README.md | 17 ++ examples/spacy/load-into-spacy.ipynb | 249 ++++++++++++++++++ examples/spacy/requirements.txt | 2 + .../test-ingest-against-api.sh | 4 +- 5 files changed, 270 insertions(+), 2 deletions(-) create mode 100644 example-docs/fake-memo.pdf create mode 100644 examples/spacy/README.md create mode 100644 examples/spacy/load-into-spacy.ipynb create mode 100644 examples/spacy/requirements.txt diff --git a/example-docs/fake-memo.pdf b/example-docs/fake-memo.pdf new file mode 100644 index 0000000000000000000000000000000000000000..98db73b2390b26581f3249bb6c8abc9767cdd6a8 GIT binary patch literal 13374 zcmdUWbwE^G*Egkf2nf<664FBq10vnsT@EmGcS<)1D2=qz9g3uMh=eoP%>i8OTmS(93@2wtD8vTC9Wmjh zvf>PoFnIo&X2Wv28bXc$Aw)W8?>i^mC8KER~~+mT!u{ zQevBx5AXxGqNgBRcv7Sz?=7Vc;#gvjGgZ9l7agfp51y!Y=5yKvxW;PK-88Il<;E|D zIw&wo((AdV>>o=c~QBosBn|hxbTp5snq7U&3T^`!lEV=-#`7O|3Lo z8|<{c7}wFtNAa>yisr&hj%m4<1Ezb_j}2pzyTW?oArsSBGdvSFl*h$_j~}ghU_fn6 z{xB5m3T|EY->j_WVGlE_62$EG=m@oS25`V`s{q(kpiXu!j>b?Y0M}1~sGY4d?7kBK z?j^7e@=y~Ch=`p#KovYHw@fE!N6U{h0r$@$geU$S_A zlcfODkWCowy|ysd1+a-hT`i2EDv~09`Y#7u+kflfzk79hqMMrP6jA746S!#zpBg}W zOzMda_`)Z3A4NG_+D{27B#6#c?2ZKCJq8UEwU3CBq149WgqVQ=Mic0VcuHNZ%3|Ux zvr$&)d!?YGxzy`!?fI#-_Y+fQdp;BO2ufE0C~eMKh<;qzT7t$O$bpE^#F9^fJ3 zXd!-FFf~PyJQG72y=vl5Nhy%(sy^9e^)qRhVQx{Pa~i+#jiUyO(%ePB5Fg`y`kan< z1u0&&x|4+C2Bq&ffq^50or5SFJ*R|vsiWNbB!88up*J~+ZaJG&dw`^oD{!I2$Bf+Gwsu0yv|?pi2VZ6e*!8u9 z@Sdi8+P&1^%Oe)_e&%bD2f^#= z%IopyLE+{N!B73QVp&Zvf_%pod>?Srqmn#Nw<0umF+}hQ=2fbbM(wfVM-Y94Ok*qP ziR81Ti^q=9)7)rY8dvTV!{Pa3?ScwZFT!>~PMBJC^#m)%#E~|KnIv~U$hs8?KNi(F z3xms_6^O+A44d4K(;j&afDq&-g@>Go%qoRsA%s#TEHH_HA6p?W}G95GzQDG$>LgY@- z@L0T;Lgzzt1rHbli?T(P_>Uhslh&i@2VP{M4gt9CUL*Ph-AhIqfnZM}yz`r?;3G$q zZ2R<0%>DjrRN{{_3o>;?0yyv6h!@EA5txGw+g_mw^`IdSOK72y9f*mD$z-Mrm?7g;#{qAvDA)Gz=*<{%X>e)cSP%bWIOm z^*%9?bbO#KWT03^lL;`S#eFHL6#tlxP*H|lTx3dtSLTC|hmeQpTNUgA+fl8e zJeh`011TiIZ!Sk(ZFC+}tv4N`LusM9E00&mS4dZ?`&90>0iT_5s*t>mdKJ|W6?4S3 z`eATsaE80w!r-QnA>o|SmeD(bI>9snF(IZ%PTg*da4fyZrbuxVXh~-Yv7EBJ9Ahnx z9?Tzf%NR^I9V>$-nuEG?lcOl!R2BG>NMVM3QQo+MBwPy2jOA zW3Gc)i=wllvsP;WOi0=7S2Lxpj^xcQE!)Xw85VdmTg+c-Tu`lLma9-bK%oDURTnJ$ zUhloCn~0(;n--g4x$<7NfO2MO=9?DD7G2-Ik3)dzXOY5btp(W;*)5O^NbZ91qoTOt zVa8mYT)r+I<97}b2LemFbQR(0(l4c>d*xLpN9WX@)eqJTVp*-4({UAp8oF~TzG`ci zNQ`Ikwm*AL<_3gmXIS?E1JN$Y<={ z*7`!-z(Ld!D!mtd<%o3te7g1k!vOX`E+!W7KJh5a1IJ;9g*tM_I93zZC_{Jq>{0cZ zge9G=i+=fu=rz|4n~vE@{;BJ_OD0@d+&Ia|?)dyyYovpi24P9%nsJ)t?_Zu8?OE>S z3p^1B7l;-hXm)Kb_F+4NTq9k(Ud$d%?5|x+Tq>gQpp>AmW4u9;K+!-Iz~n&3zvF_^ z-ip}T5J1KD#gImCln6=WO4K#@DLM#C|Ne#316F3P61r`VI(7Hw?rtb7M&e?k9Su@3 zJzshxBcq9AiQFW5B|F8(B<01u#FNBjleCz%tCuCBbfN%x{6th__q}Lb9Scvt%*uu? zZ7*Uh?Xq&}C01`wJJzlgei|%`HEh*(V5B6YqL|0K2`6YPT#+j~HvHhWf->={A@uU) zS{Kt=^nv~v^ePS|DX1!FSS+bSKipWlO!~g`5|W~D?KhVy!$be&!uZ(uTn5g<;lj@r zK{aPJBOkt@w&<5#l}irYcGus&i+T|hZzyr*emFzndB}UP*nLK|mx1+k$|J(3 zXoLPb^W@!-PJ~YO_}B69cW<~YyyU#*H~dD@DF!H51k-N3kE2dfQmgySWC!C0?!Qjs z&2)QyDAE;t5E)J!`b5)NpuT4LOL|#F+1w}pq;i9U4~~U1O_x{dNyTPmy!w1T={`+6 zA=lC4lnoh)hTG3AHfr~5nl|jKuM?L_Uo_wGS-a4`WScvw_gQ&ylTL;ldG}(Mwb|>& z;4A-z&)oGeY9X$HV818NwfVL2_EHW#h9YA|kzncd*x7_>Vbk$;*2J^H>M*mIkUCMe zn}qXn#r8Xcg@a5IH4>YWrILeD=TWnsmB*d?pQ4gRmV#Rte!M$P7%t9y?snN^X4KoU zZ#BG?A}BLg(9C-=eqys`vc5dxX0ex-A2CEt$3!GkQpuL=8b-J z=uT&2x;lRvbaeU8SQ{S9!ed}q+$$|ADgtqWngHOjstQ2=SD4KH&v;zb+0n(=Sk2C&Hj*x@m@3Owoh6~w~{zoP4(QT)$Q;?__b0Glc-?6os?v4IElY)XJz8L*gL z78ZcRLSHy5_Mc%qER(x!-A3IQ?6*-iEHM9F61@NaC`sv7#7_;`l*|C|2f7u;{ue&K zXP5t0pl8}@eE^?x@VJX89sYN8Nlh_#*B&z#TcSF$G$akP@QHMP6Vw%`qz zo5Fk$3gZRyrN4M1#vi`)FIVLGL&z@;0Bl;?Ixre69fTeK^z(mC(`|mors`tke4B-; z!36%a1N?LH^QJKTCBh^HwRVL%TNpzWpfIIw6GJhmld+?Py|bMo20L5>n8S(N!geVv zY+(;44>4A;vw_(Diy~p+=;SPF4siruu*WWIBw;@xx=q`K(L;0ec^X`IbobQ zIbjqa?92}Oc?Wh42Em^WPW_Ab{4(snJ|J+T{A=QW8bSeL12f5gFqAdK47Onan{oI^ z5!h}4hAs;)h#kPf#lsF@=j8#zRyHsgwh0G@i};44*0VR0K^Xd*+mM#1Lx{$ zc57<@7Z;2!4{`t5f=v_%<5ol1rWce0rWZ_wzbv-DZ!i4Z1A}P~WCy}T{jntm;(>X{ zf8A2}tm9>`r|BWMwyWAuJXUsc{ru}}({!k7tW9PYCbwCqU6G>;-mV({wcX|$*6-7PmvNh2Q>t5=piA0;6>P4832%h8j;>%?JN0rG z!H?&nZiY`s8tr&OwjL2HG}ZD7G=yxelPT)+6brBoUuSr+&Ur~UDO_U?B<*^POsyVy zlT4tFd3pVC+~>EZT9pXRy22*N-BEV0$Mv*bV58F7BK^Ve-b#w~q`#SOqoZs`_ok1{ zjO<=JaV-ZcUn>9OJ@a>EX4*q%7GGPepUjtp0KzSeGz2BWMF91e;GJt?kWjs&7o zVwp~@IjnvBEI%U`nJMHNd9XfemvO3K%TB<$Ph_lck#5}GI$e&-QJrN>v>7nbj%Ec40zw6YlM&x{J#?@QM zBEa1u_r9w`%a61p`9UB#)2sD{M_@N+4~&-`Qure`0rB#?{-a-s2nTfl8QT6eM=k-Q z7dn6)Px;vC>6ddTg%lh!XkT_UL)EVC`@OqI>2*Jdtr*Sl`TZaY@Wbt^F|>whM^P#8 z8y5yNF9Z6Zrt&+Gc`K30`NU(oAmrF7MKshIw3s{G+B4?<`zyL~iPxFjS6d>K)5yFx z=u8e^k)$*VT6{-lWE~_A;fJvWHGg6@^R2XvssBawJ%?y; zZb9(I)S5kocufPSEJvI7aiXf)1P4V1w_h&hXl`!jV>1HIM=d-2PZBV#8VWIA%YUeP zKVu7=MVnMuHL_iW7_1sU|B6uoz^nuYco1r`6PZ!VR?x^U$RM$E2WDPQ=6WY*t_l*m za*?Ida}8RRoBQT%jqqoIH!hFufO;G{QMj%sd~;Kk#-0rud-B*$ndg4z^US=TWYf3O z_bL-hadmHoFRIIal;LJc`X@YCV?#G=ZN%PYULg+ZKw1>+F&B;cig>?lQTVgskYxjg zeb`oA7v6Dx;=N)_L`$wQ%~b3smP70;AoBwxiB`-f_C30S8%&4E{Hqn++gH6^6Yhnh zJoJrrrw$9fv#I+~^VRk5;3~F?NTK&`y8c?o;mol?{#dbHN|PF+MwM<8fIvoT!^%a* z52XRE>T8d#*6}8^`A=H?BZ*mB{Uqn@1XcKCeME^ib=#3d#(alfTM zUl%9G5J=e)ypOUVJER)!;}VIDKup4gV9^-cC99=RhWpXrK}|2AO@c$l-C3L*omGJ$ z(JXo)MgI&$e%wtxiit{Jgd_ggiinJ1tIcVmYJ>J@J9ehahf-SNrxee;v$zWGrrs&_ z8wlJ*04xdLT|9GamXUMcymG1Me}xCr`eT zVV&T_ytwchV8+H}OOJbFe02FKPRirviha?nZ`j=DD65vZ>PFc&T8;Af;zKm^hqmdHgl`4x0Dh}?eGathLqjJgZ(7a zk#)(cnc60wA^6k1K}<6;lz0ZQhCon=VCVngFau=e!~96_{$5!W2C8?KKl!wc`MeQt z)+pu`I!JrWKgwZcypE$QfxRrs0UE_*tVgoc_dv7jHxknmw5<1)Rv#M)3?z$oJC13_ z=R17Tu-jS-eeiHCg}g(VBT>dHgvEjNoK?u!ANp9X!w%wUHrt0Z^{wncIRPZix+LnVbO?4GW*(V zl%<K0sFv6HVplNT>c;3Vpe_$sHG$c$wHOWZp(qPh>WET5=2v7NWv3NY&b= z_+U)4eT2!@x)ukCe-Pt>wNj0N9?^fNrbjWDPNieZ4fv?<$XV5EBun1RLrs)a^$BG-{bE31Hz@mW4DATDwxK_{Il8ozhnl8_$85(8WeRrrTx)% z-KHxrb+%CGks;{Ze6Xjth)U+$Dp;laqYFGluTB{{bOYU|Gr1w}7Kt$is9)HaD+W*H zUz8!%gKC_uly>umm5@INXb7zdp9WGQ@}pm%tTV3P!8^AlHOpMY*a*PK{j$zEyU7_X zFw&=p?Q15?ly#JKDIDXMf#QSYeE%3J%*gn()`sUziVM><{+9!an#I@jcG2x6<{}PR zCSAv2;6SqimoWO{uisJvGDI~#OR_mG=kT+pygVdT9%;TjUZYNka=Q^(YVZ~}C_u@l zmr2Mm-=c`+_r3aT589d+?3=&T4+JL_jk#OAXCB9qBA_n`I5IEw{s`sRINKGyf#NK8 zE=WtYX`fug1Pk8#f&QeqB04~BOn5TS3(rhc%e8hg}I zn=DJ^F*7VFV6HzBv@+E;8J_*je*Hz|2h1x$E7`z>>;_6+=ke=T+3%JeRrYa3gK^tWo;_vKoz``VyBIrG*vwy? z!b$*Fw`{7a-TCR1+znVXTkcjC8xvH^Wt3A8772Q*qV`sbG=SxDm{P@BTgA8tg(^-* zfLb^KpV=^AUV8r?x&U>nB+XeEjQ;HrzPDIFj1-L6HS{`HJ)8twB7k#a=Mh5Xt@j-v zl`H~PQESQH;p6t$>v((+b-!V+MCkZ)faw)csRABW9@yE1=6NY?i0skCs@kdLRgMbp8yuVUg`qaLxd*oQy}%GrbD8cj2s z1|{S;N<0!F(!Ok(d~&C2_Z`UwLww3fA1U`|l&O-2TC^5(fz9=?bqvVwsaFkl7C+>< z>XMh{7GNi2BCD$VN_FEb>9ionCL4OqR$*(;Jb9c>uzn%i${40)Cd0Tby{x;*&28_Y7UzdGRYQ-&> z{Y!wNv0|+uu}Y$P8QMEUJd}E;pbsn>VlCow+=1+LGgfbDSfXF{PZ}_xkNTyzy9UOQ zMvYcP(@8Ia7LRYz`vgiz8+|1G7igu)7+*Z|Ka6)z1(Xo^TtW zqq-QrOU9-cFd4nS$?e(U44$g_c+yJe2?;fb&ntLyXfMcS8~e$_AMgBWx^99RW#jz< zW!>WO_W`Gc${#}?f3p%tq|&0w@x#4si4yKV#5lvdfuPm|p z<4)m=e$dn3e3u_my>YKs81k%-V1wJ#R8f;Lf6Y-PF^%j|VT$Ebjk!!2InBl4L&9^) zbG#c~Zz+>gyY`o(M*cty2a6i@t}ZVyj9e-& zl6>7_RVd~qvT>g^Lk`f-wtdQDXY3F%^9AqnjEZGTMHJQP5xqoPTpVA>z*j4Mj$Nu9 z61hp6UeD&IT@#0wj%C|lr&?Q8M+&3)Lg{l3O+->lyFL}kE4{90qa&{fA1v_jduM{? zPAsOR@nT4c{Q1>I=|Q0ZI9rpzleW1wQpo%64oR{IQSeLI18>XEJ0C^#%y$F2U)t+` zWF9Ti7p;hmPyrOk$~lTtRMomWF(^Coo6BOEosuH^s#T^mI+cIsvaJlZLucbQI1nT6 zDluY5bu485^5iPKV&tQMTty(?yULWewwycpPA$l>|N1=L1I^A|$5R#=H*2-V!ay z8DK6e?|x+3GKF6$H^_Rh)5SHPkrwU+nnu>yIa_b#y}yPue(2f}2sxRu=`gPK8(^}Z zZG2Z~G{lH9BEqlU>1p1%G_AZj{uGN3SAS)p+T+ZcQXnmKa20-8Lb{#Vc1|*Mlw~@b z8ah~{4G5_`l4)U|v?X+p&Dm6&Kpogl)kdueE0A2-Jfc?z`03& zBB5$KXwYJLGKgWh!&588CPGLaA;85t=^et~a7uC5>VT~)`mkHGwXva7Ot(L;4dAb| znrAj2O;i4~VcaN(#Z(pYtg7HpyfJ@QV^~9Z5C`p*0s(bI(}=CALg8YT6;n^>OkrNL zYMr$13QnVhcg@C8X_lO&rdzVk-7Ih{*#j{T>q;g%dQoY&?!|s{F-J0DpQc%*7?wBd*A<5Yp1D-ae;-->rjeG@GM&t)p1-CeUK&b?sp+1 zdK&ux-)l!p(x#AXr&l`ZDN5-nI%6s51Lgs&@&yBO|HEI6L@xP?YR-cdoyrs|@x!-? zI(FM>;P;s76VHc|$RavJ9Gl0_-JQ3WrF>kNNhx$oV_uU-I%WfB_D)X+i7=|gTUvN@ z3(_;>^UV||EX_HkTu*Ql-3p96tUv6T)Gp@9dfi~x+)oWktZmDaZZjBW`Vb+xg z6PJvVn`F9DS@R#V;E7zc)qKX6hV>K>*_W%Cssl0KbNs z8Bdn>`zQH(d1^s-h#>yrTF>qgY5ki^Wo@eDwC@sk&4|Mm4}mMwj@3qt#@}chiD;R0 zmTQa5Tv|hm><3+$xN&rt?o3OXuYNn}&Xld$eG;F56C;z|CdKviE3tzFZ736k8)jsx z`;HK+fcm}5wZyfHrry^hhVhM58#76}yE0>#y+4u(WZ5Nqc8*;(_+GOcLb#@&#KkJU zAS1U<*sWUtyLdGs#&SXkYY9A8&yr|XI*V1nBW78l_&FZU%<&BkI zkxBumF789(mBbD2j|PAc@Su0*d5m;s@;d!jWvcGw0T;#n8E0f^}s8Y{<_l_x0L;YD$W^71Q1&?=U71Q(B zm*J}5a~akjhjThvuFMDo)_SU22+yky@(YlWntZ~FXsBEz&Q2dv*_;>EqV6OrzZn1G z^8n&C*wH$Ysei7&P+7Y+G>>#sJ!|;Q$2*#+IrP-<{dIC>467mjuE8wcocCcN0a=?z zRBb>JdC>CrA9sD+Kn84#$@NpBL(s(KJhpDv>8+1+5?)8`;@D;U*q;@9J*TlZgz(-O zdmg*we29b~o)Ik*W4f|4Pj*&3BW@4S$AULXVJD*8<9k8lk1cV1#3^h_7LODj%+vSf z>g8HfIXY=d_1|r1({SopCg$rXpFKBNs9aJl^w&z#aFa6I_eezXr;|9)<}LF@Fnn#W zlXZhKf?(X|yXp3vO~ZO~;>y4}zbDo`Rz3$QE#xaF8V2gVuhx5r*&)zXj(GB(cej>U zaC6Yk>-$&TNCI06-%7~DLMfZ79G=isahmNl?`^+5GE<;<5_wZ^e1jYiky-gqQ1qXu z)c-9KB>G>FAPzWQ0z-n-?9^>7ZcA2S_|7fmH~a?%BK(Q}z`&5dz(0QhKyJtWg3tcg z=ZAwT|8<*RKeoxX1ArIgm0|Qr)&a>dIyNU#!7+}ivlsc9maf!6zpiuB$sj~-fvx58s&OFNn_$4>hrnom`i*}2chRQ zpd`hOhyQ|SQFP2;>Je({!m$TMDu*?L&q6@FFuL)_CABiIcN?9oXjR#3bHswn*Lnpl z9o-ptGPxqof(d%Zb#z;?ET13Os7TD)N&wD%#}T7O;u@VLCX{8S5^r1>bz0oZm$_HJ z&6-!!e_WJA!4O`^gg5A#b{w z(Gf&Qg)J!T@eg{fHW=={*4X3fX6go&c4x6j&@iKjmWv@X~Tv!j$yJL~$<@$2- zCIvm^qphkQk|oK6JY^ibFOye3pNO^|lK4_SRqc~jd2vL_Fpc6EXor0B%$>qoxIR0( zjpaSkn;n_bN4=T-pYnKsWeNqREKDgWC3FlCuQ|7tinIr1BJv(GeS3Jv=(#tl6J9d1 zXjtfL##aUY_^FWn)QTX)&JwZ^lW#Qd=*zUC^iUQ0oH~BI?&Mfv#Dy+KbWk|ZmLY^D zitpHS2o|pTi7iO!{K$!~N_%Yc-H#fK&N7N1IzOlGJi?1!Sip8n zesIa?Eir@fa=?p3poP#A+w^h7-T|MfC3^TGW8*OGF-uvnCK@(feh%9knqBJ;Ss!o& z-OhTuDFm!P>Q=PkV)HwKd|orKzkQU~F&PSnLT0 zkb@HlV&`TDbMSBhb$~#6*dMH)yq(Ga+~v2@dUr>tDGbZ!;KBg@`2ui*K%5|eDc~0k z1cRq6lAbj0|zy8$^2rrQT8;z3-hU5Q@#>D}z27$l+918=*!T*rK z1$)@P^#cOI@Gp+Pj0JLn;NQ7_qj7QngC8e1$G@TR!eIKpjRnK2EB;2~;ez?V-)OwB zYKXtlxPah)=*Pv)1*^^Y>sT&ccyZQWXzc8qT>qp2VT=FY`*8#RLF0f|cKq4T*%1P( zNpQUVNsg+8ClqFL*l#o7RbjBwhTF*Iwr+<5?$)rtM_LS4#`2UG!o$PCW6BFtjh7b) w0dkv|@_=}`O^n!qP!p&LmjK3p@A5Mea&m@w_^r3GbAUNP7__wFiV_(A2XLn4VgLXD literal 0 HcmV?d00001 diff --git a/examples/spacy/README.md b/examples/spacy/README.md new file mode 100644 index 000000000..d3c3f7ae3 --- /dev/null +++ b/examples/spacy/README.md @@ -0,0 +1,17 @@ +# Loading `unstructured` outputs into Spacy + +The following example shows how to load `unstructured` outputs into Spacy. +This allows you to perform NLP to find important data from outputs the `unstructured` +library has extracted. +Follow the instructions [here](https://spacy.io/usage) +to install Spacy on your system. + +Once you have installed MySQL, you can connect to MySQL with the command `mysql -u root`. +You can create a non-root user and an `unstructured_example` database using the following +commands: + +## Running the example + +1. Run `pip install -r requirements.txt` to install the Python dependencies. +1. Run `jupyter-notebook to start. +1. Run the `load-into-spacy.ipynb` notebook. diff --git a/examples/spacy/load-into-spacy.ipynb b/examples/spacy/load-into-spacy.ipynb new file mode 100644 index 000000000..d47b90ec2 --- /dev/null +++ b/examples/spacy/load-into-spacy.ipynb @@ -0,0 +1,249 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "2fac3543", + "metadata": {}, + "source": [ + "# Loading Data into Spacy" + ] + }, + { + "cell_type": "markdown", + "id": "30bc0a1b", + "metadata": {}, + "source": [ + "The goal of this notebook is to show you how to start a spacy project with Unstructured's Elements. This allows you to create your NLP projects.\n", + "\n", + "Make sure you have Spacy installed on your local computer before running this notebook. If not, you can find the instructions for installation [here](https://spacy.io/usage)." + ] + }, + { + "cell_type": "markdown", + "id": "ac83c096", + "metadata": {}, + "source": [ + "# Preprocess Documents with Unstructured" + ] + }, + { + "cell_type": "markdown", + "id": "a29ef57d", + "metadata": {}, + "source": [ + "First, we'll pre-process a few documents using the the `unstructured` libraries. The example documents are available under the `example-docs` directory in the `unstructured` repo. At the end of this section, we'll wind up with a list of `Element` objects that we can pass into an `unstructured` staging brick." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "adb6b8f7", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "from unstructured.partition.auto import partition" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "8464299b", + "metadata": {}, + "outputs": [], + "source": [ + "# NOTE: Update this directory if you are running the notebook\n", + "# from somewhere other than the examples/spacy folder in the\n", + "# unstructured repo\n", + "EXAMPLE_DOCS_FOLDER = \"../../example-docs/\"" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "2fd24424", + "metadata": {}, + "outputs": [], + "source": [ + "document_to_process = \"fake-memo.pdf\"\n", + "filename = os.path.join(EXAMPLE_DOCS_FOLDER, document_to_process)\n", + "elements = partition(filename=filename, strategy=\"fast\")" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "0aa45e81", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'May 5, 2023'" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "elements[0].text" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "2429f8a5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'filename': 'fake-memo.pdf',\n", + " 'file_directory': '../../example-docs',\n", + " 'filetype': 'application/pdf',\n", + " 'page_number': 1}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "elements[0].metadata.to_dict()" + ] + }, + { + "cell_type": "markdown", + "id": "1fd556ff", + "metadata": {}, + "source": [ + "# Extract Numbers Using Spacy\n" + ] + }, + { + "cell_type": "markdown", + "id": "bdf2cefe", + "metadata": {}, + "source": [ + "Now let's import `spacy` and create a function to extract noun phrases with numbers. First we'll use a simple example then we'll use the text extracted by `unstructured`.\n", + "\n", + "The function first creates a spacy object with the text, then iterates through the spacy object to find the noun phrases with numbers. It then formats the phrases and appends to a list." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "bfd20f75", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number: 10, Noun: apples, Context: 10 apples\n", + "Number: 5, Noun: oranges, Context: 5 oranges\n" + ] + } + ], + "source": [ + "import spacy\n", + "\n", + "nlp = spacy.load(\"en_core_web_sm\")\n", + "\n", + "def extract_numbers_with_context(text):\n", + " doc = nlp(text)\n", + " numbers = []\n", + " \n", + " for token in doc:\n", + " if token.like_num and token.dep_ == 'nummod' and token.head.pos_ == 'NOUN':\n", + " number = token.text\n", + " noun = token.head.text\n", + " context = ' '.join([number, noun])\n", + " numbers.append((number, noun, context))\n", + " \n", + " return numbers\n", + "\n", + "# Example usage\n", + "text = \"I bought 10 apples and 5 oranges yesterday.\"\n", + "numbers_with_context = extract_numbers_with_context(text)\n", + "\n", + "for number, noun, context in numbers_with_context:\n", + " print(f\"Number: {number}, Noun: {noun}, Context: {context}\")" + ] + }, + { + "cell_type": "markdown", + "id": "7eae9735", + "metadata": {}, + "source": [ + "### Using the Data Extracted with Unstructured's Library" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "7c738f91", + "metadata": {}, + "outputs": [], + "source": [ + "numbers_with_context = extract_numbers_with_context(elements[2].text)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "3459555b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number: 20,000, Noun: bottles, Context: 20,000 bottles\n", + "Number: 10,000, Noun: blankets, Context: 10,000 blankets\n", + "Number: 200, Noun: laptops, Context: 200 laptops\n", + "Number: 3, Noun: trucks, Context: 3 trucks\n", + "Number: 15, Noun: hours, Context: 15 hours\n" + ] + } + ], + "source": [ + "for number, noun, context in numbers_with_context:\n", + " print(f\"Number: {number}, Noun: {noun}, Context: {context}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dadd055a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/spacy/requirements.txt b/examples/spacy/requirements.txt new file mode 100644 index 000000000..87974f1d0 --- /dev/null +++ b/examples/spacy/requirements.txt @@ -0,0 +1,2 @@ +unstructured[local-inference] +spacy \ No newline at end of file diff --git a/test_unstructured_ingest/test-ingest-against-api.sh b/test_unstructured_ingest/test-ingest-against-api.sh index 4315ed7eb..63d9b09be 100755 --- a/test_unstructured_ingest/test-ingest-against-api.sh +++ b/test_unstructured_ingest/test-ingest-against-api.sh @@ -16,8 +16,8 @@ PYTHONPATH=. ./unstructured/ingest/main.py \ set +e -if [ "$(find 'api-ingest-output' -type f -printf '.' | wc -c)" != 4 ]; then +if [ "$(find 'api-ingest-output' -type f -printf '.' | wc -c)" != 5 ]; then echo - echo "4 files should have been created." + echo "5 files should have been created." exit 1 fi