Calling JavaScript from Python to de-CloudFlare scraped content
Yesterday I wrote a script to scrape my own web page because I screwed up the CSV export feature and Product needed the data. One problem was that the CloudFlare CDN obfuscated the email addresses on the page. My solutioncrazy hack: running a Node.js script to de-obfuscate the email from my Python scraping script.
Example obfuscated email stuff from CloudFlare:
<a href="/cdn-cgi/l/email-protection#d4b7b5a6b194a4b1a0e7e2e4fab7bbb9">
<span class="__cf_email__" data-cfemail="0162607364417164753237312f626e6c">[email protected]</span>
<script cf-hash='f9e31' type="text/javascript">
/* <![CDATA[ */!function(){try{var t="currentScript"in document?document.currentScript:function(){for(var t=document.getElementsByTagName("script"),e=t.length;e--;)if(t[e].getAttribute("cf-hash"))return t[e]}();if(t&&t.previousSibling;){var e,r,n,i,c=t.previousSibling,a=c.getAttribute("data-cfemail");if(a){for(e="",r=parseInt(a.substr(0,2),16),n=2;a.length-n;n+=2)i=parseInt(a.substr(n,2),16)^r,e+=String.fromCharCode(i);e=document.createTextNode(e),c.parentNode.replaceChild(e,c)}}}catch(u){}}();/* ]]> */
</script>
</a>
Using jsbeautifier.org, I adapted the JavaScript from above into this Node.js script, decloudflare.js
.
var e, r, n, i, a = process.argv[2];
for (e = "", r = parseInt(a.substr(0, 2), 16), n = 2; a.length - n; n += 2) i = parseInt(a.substr(n, 2), 16) ^ r, e += String.fromCharCode(i);
console.log(e);
Example usage:
$ node decloudflare.js 0162607364417164753237312f626e6c [email protected]
I used the Naked library (thanks to Sweetmeat) to call the Node.js script. (Though probably I could've just used the subprocess module.)
$ pip install Naked
from Naked.toolshed.shell import muterun_js
def decloudflare_email(cfemail):
resp = muterun_js('decloudflare.js', cfemail)
return resp.stdout.rstrip()
cfemail = '0162607364417164753237312f626e6c'
print 'cfemail from python: ' + cfemail
email = decloudflare_email(cfemail)
print 'email from python: ' + email
cfemail from python: 0162607364417164753237312f626e6c email from python: [email protected]
Related posts
- How to use ast-grep with GraphQL — posted 2024-09-24
- Next.js App Router (RSC) projects w/ open source code — posted 2024-07-30
- Next.js Relay GraphQL Pokemon example — posted 2024-05-22
- Example Node.js Passport.js SAML app using OneLogin — posted 2024-05-10
- Aphrodite to CSS Modules codemod — posted 2022-12-09
- Simple codemod example with jscodeshift — posted 2021-05-03
Comments
Is it possible in C# without javascript?
disqus:3586963801