How to Upload to Sql Server 2012 With Pandas
This commodity was published every bit a part of the Data Science Blogathon
"As the different streams having their sources in different places all mingle their water in the bounding main, so, O Lord, the different paths which men take through different tendencies, diverse though they announced, crooked or straight, all lead to Thee."
Swami Vivekanand
As the wise men like Vivekanand expressed, at that place are many paths that men/women take to achieve their destinations (be information technology search of God or Model Accuracy). And all of those who take learned data science and related streams will be able to relate with this, that these unlike paths showtime with different languages, dissimilar tools, and different expectations. During this long and arduous journey towards excellence, some start from SQL, and others larn Python. Only as we all know by at present, in long run knowing only one of these will not suffice.
SQL is necessary for talking to the database and Python is the de-facto leader in taking it all to the Machine Learning journey. This is my second article in this series to make people more than conversant in both languages. More specifically, those who are amend off with SQL and learning Python now, they volition exist happy to know that almost all the functions and operations performed in SQL can be replicated in Python. Some of them may fifty-fifty exist more than efficiently done in Python, every bit compared to SQL.
Before nosotros tin commencement with looking at the equivalent "SQL functions in Python, let us revise how we linked SQL databases to Python. Y'all may get through my previous commodity, the first one in this series "How-to-Access-use-SQL-Database-with-pyodbc-in-Python" for detailed understanding. The code to start importing SQL Database data from MS SQL is as below.
Photo past Author
Import data from SQL to Python
# Let's commencement with connecting SQL with Python and Importing the SQL data equally DataFrame import pyodbc import pandas as pd import numpy as np connection_string = ("Driver={SQL Server Native Client 11.0};" "Server=Your_Server_Name;" "Database=My_Database_Name;" "UID=Your_User_ID;" "PWD=Your_Password;") connection = pyodbc.connect(connection_string) # Using the same query as above to get the output in dataframe # We are importing top 10 rows and all the columns of State_Population Table population = pd.read_sql('SELECT Elevation(10) * FROM State_Population', connection) # OR # write the query and assign information technology to variable query = 'SELECT * FROM STATE_AREAS WHERE [area (sq. mi)] > 100000' # use the variable name in place of query string area = pd.read_sql(query, connection) The output above is imported in Python as Pandas DataFrame.
Once we take the data in the grade of DataFrame, now we tin meet how to manipulate them using Pandas in Python. In this article, we are going to see how nosotros can replicate the SQL constructs in Python. In that location is no one "All-time" way, but many good paths. You chose the 1 you wish for.
Basic SQL Queries
We are going to deconstruct the most basic of the SQL queries and see how the same result can exist achieved in Python. The queries which we volition talk over in this commodity are
- SELECT column_name(s)
- FROM table_name
- WHERE condition
- GROUP Past column_name(s)
- HAVING condition
- Gild Past column_name(due south)
The methodology we are going to prefer is similar this: We will write a SQL query, and then listing some possible ways in which the same result can be accomplished in Python. We accept three tables in the database which we are going to use, and nosotros take imported two of them equally DataFrames in Python already. We will utilize one of these Data Frames (population) to understand these concepts.
The tabular array State_Population is already imported and the DataFrame is named as population.
| country/region | ages | year | population | |
|---|---|---|---|---|
| 0 | AL | under18 | 2012 | 1117489.0 |
| 1 | AL | total | 2012 | 4817528.0 |
| two | AL | under18 | 2010 | 1130966.0 |
| iii | AL | total | 2010 | 4785570.0 |
| 4 | AL | under18 | 2011 | 1125763.0 |
Let the states run across how to replicate the SQL function in Python and get the same or similar results.
Note: The main headings are the wide SQL query names. And inside those headings, the actual SQL query beingness used to replicate are written in Assuming and note format. Beneath them, all the Python ways to replicate them are mentioned as numbered methods, one afterward the other.
SELECT column_name(south)
FROM table_name
SELECT * FROM State_Population;
This SQL query volition fetch all the columns (and all the rows as well) from the state_population tabular array. The same result can exist accomplished by just calling the DataFrame in Python.
1. Call the DataFrame in Python
Out[iii]:
| land/region | ages | year | population | |
|---|---|---|---|---|
| 0 | AL | under18 | 2012 | 1117489.0 |
| 1 | AL | full | 2012 | 4817528.0 |
| 2 | AL | under18 | 2010 | 1130966.0 |
| 3 | AL | total | 2010 | 4785570.0 |
| 4 | AL | under18 | 2011 | 1125763.0 |
| … | … | … | … | … |
| 2539 | USA | total | 2010 | 309326295.0 |
| 2540 | U.s.a. | under18 | 2011 | 73902222.0 |
| 2541 | Usa | total | 2011 | 311582564.0 |
| 2542 | USA | under18 | 2012 | 73708179.0 |
| 2543 | U.s. | total | 2012 | 313873685.0 |
2544 rows × 4 columns
SELECT twelvemonth FROM State_Population;
This SQL query will fetch the column(yr) and all the rows from the state_population table. In Python, it can exist achieved in the following ways. I thing to observe here is that when we select but one column, information technology gets converted to pandas series object from a pandas DataFrame object. We convert it back to DataFrame past using the DataFrame function.
two. Phone call the DataFrame.ColumnName
# By calling the dataframe.column pd.DataFrame(population.year)
Out[iv]:
| year | |
|---|---|
| 0 | 2012 |
| one | 2012 |
| 2 | 2010 |
| 3 | 2010 |
| 4 | 2011 |
| … | … |
| 2539 | 2010 |
| 2540 | 2011 |
| 2541 | 2011 |
| 2542 | 2012 |
| 2543 | 2012 |
2544 rows × 1 columns
SELECT population, year FROM State_Population;
This query will fetch 2 columns(population and twelvemonth) and all the rows from the state_population table. In Python, it tin can be achieved in the post-obit ways.
3. Call the DataFrame with column names (Selecting)
Notice the names of the columns as a listing , inside the indexing brackets [].
population[['population', 'yr']] Out[5]:
| population | year | |
|---|---|---|
| 0 | 1117489.0 | 2012 |
| one | 4817528.0 | 2012 |
| 2 | 1130966.0 | 2010 |
| three | 4785570.0 | 2010 |
| four | 1125763.0 | 2011 |
| … | … | … |
| 2539 | 309326295.0 | 2010 |
| 2540 | 73902222.0 | 2011 |
| 2541 | 311582564.0 | 2011 |
| 2542 | 73708179.0 | 2012 |
| 2543 | 313873685.0 | 2012 |
2544 rows × 2 columns
four. Use the pandas .loc() method
The syntax for loc method is df.loc([row names], [column names]). If instead of the list of names, only ":" is passed, it means to consider all. So df.loc(: , [cavalcade names]) means fetch all rows for the given cavalcade names.
population.loc[:,['population', 'twelvemonth']] Out[six]:
| population | year | |
|---|---|---|
| 0 | 1117489.0 | 2012 |
| ane | 4817528.0 | 2012 |
| 2 | 1130966.0 | 2010 |
| 3 | 4785570.0 | 2010 |
| four | 1125763.0 | 2011 |
| … | … | … |
| 2539 | 309326295.0 | 2010 |
| 2540 | 73902222.0 | 2011 |
| 2541 | 311582564.0 | 2011 |
| 2542 | 73708179.0 | 2012 |
| 2543 | 313873685.0 | 2012 |
2544 rows × ii columns
The DataFrame above is the output from all the above codes. Dissimilar methods, same output.
SELECT column_name(s)
FROM table_name
WHERE condition
SELECT * FROM State_Population WHERE year = 2010;
This query will fetch all the columns and only those rows from the state_population table where the year column has a value equal to 2010. In Python, information technology can be achieved in the following ways.
5. Use Python'south Slicing method
population[population.year == 2010] Out[7]:
| land/region | ages | year | population | |
|---|---|---|---|---|
| 2 | AL | under18 | 2010 | 1130966.0 |
| three | AL | full | 2010 | 4785570.0 |
| xc | AK | under18 | 2010 | 187902.0 |
| 91 | AK | total | 2010 | 713868.0 |
| 100 | AZ | under18 | 2010 | 1628563.0 |
| … | … | … | … | … |
| 2405 | WY | full | 2010 | 564222.0 |
| 2490 | PR | total | 2010 | 3721208.0 |
| 2491 | PR | under18 | 2010 | 896945.0 |
| 2538 | United states | under18 | 2010 | 74119556.0 |
| 2539 | USA | total | 2010 | 309326295.0 |
106 rows × four columns
6. Employ pandas .loc() method
In [viii]:
population.loc[population.year == 2010,:]
| state/region | ages | twelvemonth | population | |
|---|---|---|---|---|
| two | AL | under18 | 2010 | 1130966.0 |
| 3 | AL | total | 2010 | 4785570.0 |
| ninety | AK | under18 | 2010 | 187902.0 |
| 91 | AK | total | 2010 | 713868.0 |
| 100 | AZ | under18 | 2010 | 1628563.0 |
| … | … | … | … | … |
| 2405 | WY | total | 2010 | 564222.0 |
| 2490 | PR | total | 2010 | 3721208.0 |
| 2491 | PR | under18 | 2010 | 896945.0 |
| 2538 | U.s.a. | under18 | 2010 | 74119556.0 |
| 2539 | U.s.a. | total | 2010 | 309326295.0 |
106 rows × 4 columns
7. Use pandas .query() method
Notice that the input for df.query() is always a cord.
population.query('year == 2010') Out[nine]:
| state/region | ages | year | population | |
|---|---|---|---|---|
| two | AL | under18 | 2010 | 1130966.0 |
| 3 | AL | total | 2010 | 4785570.0 |
| ninety | AK | under18 | 2010 | 187902.0 |
| 91 | AK | total | 2010 | 713868.0 |
| 100 | AZ | under18 | 2010 | 1628563.0 |
| … | … | … | … | … |
| 2405 | WY | total | 2010 | 564222.0 |
| 2490 | PR | total | 2010 | 3721208.0 |
| 2491 | PR | under18 | 2010 | 896945.0 |
| 2538 | Usa | under18 | 2010 | 74119556.0 |
| 2539 | USA | total | 2010 | 309326295.0 |
106 rows × four columns
8. Apply pandas lambda role
Notice that the apply method is used, to utilize the lambda office to every element of the cavalcade. Its result is then fed inside indexing brackets to piece the original DataFrame.
population[population.apply(lambda x: 10["year"] == 2010, axis=1)] Out[10]:
| land/region | ages | yr | population | |
|---|---|---|---|---|
| 2 | AL | under18 | 2010 | 1130966.0 |
| 3 | AL | total | 2010 | 4785570.0 |
| 90 | AK | under18 | 2010 | 187902.0 |
| 91 | AK | total | 2010 | 713868.0 |
| 100 | AZ | under18 | 2010 | 1628563.0 |
| … | … | … | … | … |
| 2405 | WY | full | 2010 | 564222.0 |
| 2490 | PR | total | 2010 | 3721208.0 |
| 2491 | PR | under18 | 2010 | 896945.0 |
| 2538 | United states of america | under18 | 2010 | 74119556.0 |
| 2539 | USA | total | 2010 | 309326295.0 |
106 rows × four columns
The DataFrame to a higher place is the output from all the above codes. Different methods, aforementioned output.
SELECT state/region, population, year FROM State_Population WHERE year = 2010 or 2012 and ages = under18;
This query will fetch these 3 columns(state/region, population, year) and just those rows from the state_population table where the yr column has a value equal to 2010 or 2012 and the ages columns accept a value equal to "under18". In Python, information technology can be achieved in the following ways.
9. Utilize Pythons indexing and slicing
# By using Pythons indexing and slicing population[(population.twelvemonth.isin([2010, 2012])) & (population.ages == "under18")][['state/region', 'population', 'year']]
Out[11]:
| state/region | population | year | |
|---|---|---|---|
| 0 | AL | 1117489.0 | 2012 |
| ii | AL | 1130966.0 | 2010 |
| xc | AK | 187902.0 | 2010 |
| 94 | AK | 188162.0 | 2012 |
| 96 | AZ | 1617149.0 | 2012 |
| … | … | … | … |
| 2404 | WY | 135351.0 | 2010 |
| 2491 | PR | 896945.0 | 2010 |
| 2494 | PR | 841740.0 | 2012 |
| 2538 | U.s. | 74119556.0 | 2010 |
| 2542 | USA | 73708179.0 | 2012 |
106 rows × 3 columns
x. Use pandas .loc() method
population.loc[(population.year.isin([2010, 2012])) & (population.ages == "under18"),['state/region', 'population', 'yr']] Out[12]:
| land/region | population | year | |
|---|---|---|---|
| 0 | AL | 1117489.0 | 2012 |
| 2 | AL | 1130966.0 | 2010 |
| 90 | AK | 187902.0 | 2010 |
| 94 | AK | 188162.0 | 2012 |
| 96 | AZ | 1617149.0 | 2012 |
| … | … | … | … |
| 2404 | WY | 135351.0 | 2010 |
| 2491 | PR | 896945.0 | 2010 |
| 2494 | PR | 841740.0 | 2012 |
| 2538 | Us | 74119556.0 | 2010 |
| 2542 | USA | 73708179.0 | 2012 |
106 rows × 3 columns
11. Use pandas .query() method
Find that the input for df.query() is always a string.
population.query('(year==2010 | year==2012) & ages == "under18"')[['country/region', 'population', 'year']] Out[thirteen]:
| land/region | population | year | |
|---|---|---|---|
| 0 | AL | 1117489.0 | 2012 |
| 2 | AL | 1130966.0 | 2010 |
| 90 | AK | 187902.0 | 2010 |
| 94 | AK | 188162.0 | 2012 |
| 96 | AZ | 1617149.0 | 2012 |
| … | … | … | … |
| 2404 | WY | 135351.0 | 2010 |
| 2491 | PR | 896945.0 | 2010 |
| 2494 | PR | 841740.0 | 2012 |
| 2538 | USA | 74119556.0 | 2010 |
| 2542 | U.s. | 73708179.0 | 2012 |
106 rows × 3 columns
12. Use lambda function
population[population.apply(lambda x: (10["year"] in [2010, 2012]) & (10["ages"] == "under18"), centrality=one)] Out[14]:
| state/region | ages | twelvemonth | population | |
|---|---|---|---|---|
| 0 | AL | under18 | 2012 | 1117489.0 |
| two | AL | under18 | 2010 | 1130966.0 |
| 90 | AK | under18 | 2010 | 187902.0 |
| 94 | AK | under18 | 2012 | 188162.0 |
| 96 | AZ | under18 | 2012 | 1617149.0 |
| … | … | … | … | … |
| 2404 | WY | under18 | 2010 | 135351.0 |
| 2491 | PR | under18 | 2010 | 896945.0 |
| 2494 | PR | under18 | 2012 | 841740.0 |
| 2538 | U.s.a. | under18 | 2010 | 74119556.0 |
| 2542 | USA | under18 | 2012 | 73708179.0 |
106 rows × 4 columns
The DataFrame above is the output from all the above codes. Unlike methods, same output.
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP By column_name(s)
HAVING condition
SELECT * FROM State_Population WHERE ages = total GROUP By state/region HAVING AVG(population) > 10000000;
The Grouping By functions of SQL and Pandas look the same on the surface, but pandas groupby is far more than capable and efficient, especially for more complex operations. To implement the in a higher place functioning from SQL in python, let usa see the pandas groupby function upwardly close.
It is possible to group past using 1 or more than columns. For one column, just laissez passer the cavalcade name, and for more one column, pass the names as a list.
# grouped by land/region population.groupby(past = 'country/region')
Out[15]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016982CD0408>
# grouped past land/region and year population.groupby(by = ['state/region', 'year']) Out[xvi]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016982D7A908>
The output is a groupby object. The output shows that the grouping has been done, and the groupby function has washed its task. But as we accept not been told to amass by which part, the output is not in form of DataFrame. So allow's do that now.
population.groupby(past = ['state/region', 'twelvemonth']).count() Out[17]:
| ages | population | ||
|---|---|---|---|
| state/region | year | ||
| AK | 1990 | 2 | 2 |
| 1991 | ii | 2 | |
| 1992 | ii | 2 | |
| 1993 | ii | two | |
| 1994 | 2 | ii | |
| … | … | … | … |
| WY | 2009 | 2 | 2 |
| 2010 | two | 2 | |
| 2011 | ii | 2 | |
| 2012 | 2 | 2 | |
| 2013 | 2 | 2 |
1272 rows × ii columns
We can assign this groupby object to a variable, and so use that variable for farther operations.
grouped = population.groupby(past = ['state/region', 'year']) At present allow's replicate the SQL Query. To add together the HAVING function as well, we demand to use the groupby and then filtering on the condition. The python implementation of the above SQL code is equally below.
thirteen. groupby and amass in Pandas
df = pd.DataFrame(population.loc[population.ages == 'total', :].groupby(by = 'country/region').aggregate('population').mean()) df.loc[df.population > 10000000, :] Out[19]:
| population | |
|---|---|
| state/region | |
| CA | 3.433414e+07 |
| FL | 1.649654e+07 |
| IL | 1.237080e+07 |
| NY | ane.892581e+07 |
| OH | 1.134238e+07 |
| PA | 1.236960e+07 |
| TX | 2.160626e+07 |
| USA | 2.849979e+08 |
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP By column_name(s)
HAVING condition
Club BY column_name(s)
SELECT * FROM State_Population WHERE ages = full Grouping By land/region HAVING AVG(population) > 10000000 ORDER Past population;
The order by in SQL is used to sort the table in the given lodge. In the above SQL lawmaking, the table needs to be ordered in ascending order (default). This task can exist accomplished past using the pandas sort_values() method.
14. Order by using sort_values() in Python
df = pd.DataFrame(population.loc[population.ages == 'total', :].groupby(by = 'country/region').aggregate('population').mean()) df.loc[df.population > 10000000, :].sort_values(by = 'population') Out[20]:
| population | |
|---|---|
| state/region | |
| OH | 1.134238e+07 |
| PA | one.236960e+07 |
| IL | 1.237080e+07 |
| FL | 1.649654e+07 |
| NY | 1.892581e+07 |
| TX | 2.160626e+07 |
| CA | 3.433414e+07 |
| U.s.a. | two.849979e+08 |
The sort is done in ascending order by default. To alter that, ascending = Simulated shall be used.
df.loc[df.population > 10000000, :].sort_values(past = 'population', ascending = False) Out[21]:
| population | |
|---|---|
| state/region | |
| U.s.a. | 2.849979e+08 |
| CA | 3.433414e+07 |
| TX | ii.160626e+07 |
| NY | 1.892581e+07 |
| FL | i.649654e+07 |
| IL | 1.237080e+07 |
| PA | 1.236960e+07 |
| OH | 1.134238e+07 |
Bonus: Sort by Multiple columns
Pandas give the functionality to sort by multiple columns. Non only that, You can choose which columns to sort in ascending order and which one in descending social club. Allow us groupby and sort our population data set on. Nosotros volition grouping them under state and year, and sort them on year and population.
fifteen. sort_values() on more than one columns
In [22]:
# Grouping past and the grouped tabular array grouped = population.groupby(past = ['state/region', 'year']).mean() grouped Out[22]:
| population | ||
|---|---|---|
| state/region | year | |
| AK | 1990 | 365396.0 |
| 1991 | 376186.5 | |
| 1992 | 386807.0 | |
| 1993 | 393312.0 | |
| 1994 | 395373.5 | |
| … | … | … |
| WY | 2009 | 347405.five |
| 2010 | 349786.5 | |
| 2011 | 351368.0 | |
| 2012 | 356576.0 | |
| 2013 | 360168.5 |
1272 rows × i columns
# Sorting the Grouped table in # Ascending society of Yr and (Increasing Twelvemonth) # Descending social club of population (decreasing population) grouped.sort_values(by = ['year', 'population'], ascending=[Truthful, Faux]) Out[23]:
| population | ||
|---|---|---|
| state/region | twelvemonth | |
| USA | 1990 | 156920663.0 |
| CA | 1990 | 18970008.0 |
| NY | 1990 | 11151213.5 |
| TX | 1990 | 10981487.5 |
| FL | 1990 | 8011057.0 |
| … | … | … |
| AK | 2013 | 461632.0 |
| ND | 2013 | 443040.5 |
| DC | 2013 | 378961.5 |
| VT | 2013 | 374665.5 |
| WY | 2013 | 360168.5 |
1272 rows × one columns
16 Conclusion:
Yous must take wondered how organizations manage their huge databases. They most certainly do not continue it in Excel or other spreadsheet formats. The real-life business organization databases are maintained in a relational database system, which is created and accessed most commonly using SQL. Hence knowing SQL is a necessary tool for any data scientist. But SQL is more powerful than just a information picking tool. It is capable of many data wrangling and data manipulation tasks. But so is Python.
At present no single language is sufficient to consummate all the tasks, with operational efficiency. Hence a deep agreement of both SQL and Python will help you chose which one to utilize for which job.
If you want to simply select, filter, and basic operations on data, you tin practise that efficiently in SQL. Simply if there's a need for complex grouping operations, and more data manipulation, Pandas in Python would exist a more apt option.
There are many benefits in using multiple data analysis languages, as y'all can customize and use a hybrid approach well suited for your ever-evolving needs.
To see in particular how to connect Python with SQL or SQL server, read How-to-Access-employ-SQL-Database-with-pyodbc-in-Python.
The implied learning in this article was, that y'all can use Python to exercise things that you thought were only possible using SQL. There may or may not be directly forward solution to things, simply if you are inclined to discover it, there are plenty resources at your disposal to find a way out. Y'all tin expect at the mix and friction match the learning from my book, PYTHON MADE EASY – Step past Step Guide to Programming and Data Analysis using Python for Beginners and Intermediate Level.
Most the Author: I am Nilabh Nishchhal. I like making seemingly difficult topics piece of cake and write about them. Check out more than at https://www.authornilabh.com/. My endeavour to brand Python piece of cake and Accessible to all is "Python Made Easy".
The media shown in this article are not owned by Analytics Vidhya and are used at the Author'due south discretion.
gonzalezcrigoithave.blogspot.com
Source: https://www.analyticsvidhya.com/blog/2021/06/15-pandas-functions-to-replicate-basic-sql-queries-in-python/
Post a Comment for "How to Upload to Sql Server 2012 With Pandas"