jorgesancha/python_code_test_carto.md

Last active March 21, 2024 00:06

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/jorgesancha/2a8027e5a89a2ea1693d63a45afdd8b6.js"></script>
Save jorgesancha/2a8027e5a89a2ea1693d63a45afdd8b6 to your computer and use it in GitHub Desktop.

Python code test - CARTO

Raw

What follows is a technical test for this job offer at CARTO: https://boards.greenhouse.io/cartodb/jobs/705852#.WSvORxOGPUI

Build the following and make it run as fast as you possibly can using Python 3 (vanilla). The faster it runs, the more you will impress us!

Your code should:

Download this ~2GB file: https://s3.amazonaws.com/carto-1000x/data/yellow_tripdata_2016-01.csv
Count the lines in the file
Calculate the average value of the tip_amount field.

All of that in the most efficient way you can come up with.

That's it. Make it fly!

blackrez commented Jul 17, 2021 •

edited

Hello,

It was fun to play with, there is a lot of solution but I like this 2.
I think streaming is the future of data and I hate to download big file.

Solution 1

import csv
import urllib.request
import codecs

url = "https://s3.amazonaws.com/carto-1000x/data/yellow_tripdata_2016-01.csv"
stream = urllib.request.urlopen(url)
csvfile = csv.DictReader(codecs.iterdecode(stream, 'utf-8'))
count = 0
z = 0.0
for line in csvfile:
    z = float(line['tip_amount']) + z
    count = count + 1

print("final")
print(z)
print(count)
avg = z/count
print(avg)

Also I think people should be lazy and use some else computing capacities (I cheated I used boto3 but I don't have the time to rewrite a SDK) and in this case AWS and S3.

Solution 2

import boto3
s3 = boto3.client('s3')

resp = s3.select_object_content(
    Bucket='carto-1000x',
    Key='data/yellow_tripdata_2016-01.csv',
    ExpressionType='SQL',
    Expression="SELECT avg(cast(tip_amount as float)) , count(1) FROM s3object s",
    InputSerialization = {'CSV': {"FileHeaderInfo": "Use", 'FieldDelimiter': ',','RecordDelimiter': '\n'}, 'CompressionType': 'NONE'},
    OutputSerialization = {'CSV': {}},
)
for event in resp['Payload']:
    if 'Records' in event:
         records = event['Records']['Payload'].decode('utf-8')
         print(records)

I think I will an article on this solutions.

edit I forget the count in solution 2.

jorgesancha/python_code_test_carto.md

blackrez commented Jul 17, 2021 • edited

blackrez commented Jul 17, 2021 •

edited